1 Introduction

The world is currently facing a significant number of deaths mainly caused by heart attacks. Developing countries, especially Asian and African regions, face a significant amount of failures to save human lives just because of the late determination of the severity of the attack [1, 2]. Detection of a heart attack at an early stage may significantly help prevent the attack. Day by day practice by medical practitioners generated a treasure of datasets that can be analyzed to determine the important attributes while diagnosing a heart attack [3]. Unfortunately, currently, these datasets are not being effectively to serve the purpose. The main goal of the research is to use those real-life datasets in a way that may help timely predict a potential heart attack. Various data analysis and data mining techniques are there to serve this purpose [4]. A lot of people die experiencing symptoms that were previously undiscovered or simply ignored. It is time to predict heart disease before its actual occurrence. There are several main causes of heart disease. Some of them may be high cholesterol levels, blood pressure, smoking, use of alcoholic drinks, high sugar, lack of physical activities, cardiovascular disease (CVD), and a hypertensive heart [5].

The importance of data science and Big Data in terms of online shopping, search, and multimedia is not much less than the Internet in the current era. People want to know more about their surroundings. The most important questions that come in a typical user’s mind are what is happening, why it happened, and what is likely to happen in the future. An analyzer may even want to know what will happen through ages. A user may be curious about a loan being offered by a bank, but he/she may want to first analyze their services and potential cost and benefits involved in getting a loan from that bank. Another user may want to invest his money and want to know the best way to invest it. Someone may be unsatisfied with their daily routine and want to optimize it and even do this optimization in a few moments. He may also want to know the trends regarding this matter. Another major application of Big Data is involved in online shopping where the consumer is always confused that which product is better to buy from a haystack of brands. Some researchers and big data professionals claim that IoT may one day outperform or take-over big data as the most glorified technology in the world [6]. It may be true someday but one thing to be considered that IoT cannot come alive without big data. Gartner defines Big Data as a set of information assets attributing to large volumes, huge velocity, and height of variety, at the same time demanding cost-effective and innovative shapes of information processing to support the advanced insight and decision-making capabilities [7]. In real life, most of the data are voluminous. As an instance of examples, social media, especially Facebook has 800 Million active members. There are around four billion phones and more than 25% are smartphones, and there are billions of RFID tags as well.

1.1 Big Data

The term Big Data refers to a collection of large datasets that are impossible to process with traditional computing techniques [7]. Normally, there is a centralized server in traditional enterprise systems for the storage and processing of data. Here comes the need for MapReduce, which, basically is a programming model used to write applications with the ability to process Big Data on multiple nodes in a parallel fashion. MapReduce is responsible to provide analytical capabilities to analyze very large volumes of data with increased complexity. The volume, variety, and velocity became known as the 3 V’s of big data. Further, Oracle added another V i.e., “Value” of big data meaning that how much data is valuable and important to analyze. The variety of data can be structured, un-structured, and semi-structured. The velocity of data can be determined by the incoming data rate. The incoming rate can be as slow incoming data and fast incoming data. On the other hand, the volume of data can be in the form of Terabytes and Exabyte.

In the past few years, Big Data analytics is being used by different companies in a more and more meaningful way. Without analysis, data are pretty much useless. Recently, large companies have understood the worth of data and have tried to give better facilities and services to their clients. Like, Google works by giving relevant advertisements to user’s web surfing and amazon suggests the products according to one’s interest and YouTube recommender system that displays different videos relevant to the user’s previous video search history.

The processing of big data is an exhaustive task. One may face different challenges during data loading, storing, and processing. The first challenge is “Large Volume”. We must store a large dataset on disk. Hence, more hardware resources/disks are required to store because single hardware is not enough. Our physical structure should be vertically scalable as we have to add more hardware like RAM, CPU, and memory, etc. The physical structure should also be horizontally scalable to increase the number of nodes/machines to distribute the data among them. Other challenges include high throughput, heterogeneity, relating & linking, and data complexity.

1.2 Machine Learning

The practical implementation of artificial intelligence, in which the user can program intelligent algorithms to get more accurate results and predict the output within an acceptable range [8]. ML algorithms can be divided into supervised and unsupervised learning; in supervised learning, input data along with target values are given to the algorithm to train on and predict the output values with a certain accuracy. The algorithm train from the given data and can apply the learned model on new datasets. The concept is similar to predictive modeling and data mining techniques in which we look deep into datasets and find the patterns. Unsupervised learning, on the other hand, is used for more complex jobs and doesn’t require to be trained by giving the required outcome data. The target of unsupervised learning is to group those datasets into sensible classes [9, 10].

1.3 Deep Learning

The term ‘deep learning’ (DL) is also called hierarchical or deep-structured learning [11,12,13]. Unlike, task-based methods, DL is a type of ML technique that is based on learned-data representation and here the learning can either be supervised, unsupervised, or semi-supervised. The models of DL are vaguely encouraged from the working of biological nervous systems like how the information is processed and communicated in it. However, these DL techniques are structurally and functionally different from human brains. These differences make them incompatible with neuroscience evidence. The architectures of DL such as convolutional neural networks (CNN), DL networks, recurrent NN, and deep belief networks have been employed to various research areas including recognizing human speech, computer vision (CV), audio recognition, manipulation of natural language, machine translation, filtering social sites, drug design, bioinformatics, processing of the medical image, board game programs, and material examination. These advanced machine learning models have generated equal to and, in some scenarios better results than humans.

1.4 Applications of Deep Learning

Recently, DL techniques have regenerated the NN models. Researchers have represented the stacked restricted Boltzmann machines and autoencoders, which are exhibiting the remarkable performance in the field of digital image processing. The great advancement in the variation of NN models has allowed the DL techniques to be employed in the field of CV, audio/visual content processing, and many other research areas. RNN techniques work well in processing the sequenced-data; therefore, several RNN models’ variations are developed and employed in the field of sequence-based tasks like long short-term memory (LSTM) and gated recurrent unit (GRU). These methods are showing improved performance in numerous areas like in the field of recognizing handwriting, translating and modeling languages, and acoustic speech modeling.

1.5 Applications in Healthcare

Because of the promising results of DL approaches in various fields, now researchers are employing these methods in the field of medicine as well [14]. Some of the main research areas of DL in the field of healthcare are the detection of phenotypic patterns from serum uric acid measurement, determination of physiologic patterns, and predicting the severity level of various diseases [15]. The proposed work is based on detecting the HF at the early stages by employing the available HER information of patients.

To understand the associations between various clinical procedures or among the Unified Medical Language System (UMLS), now, DL [4, 8, 11] approaches are employing the textual data of healthcare units by utilizing the Skip-gram method [16, 17]. Skip-gram works by identifying the low-dimensional HER data representation like process, diagnostic, medication codes, etc. We employed this concept in our proposed technique to obtain the same data representation. Our work is concerned with temporal data modeling by utilizing CNN for HF prediction at its earliest stage.

The main contributions of the proposed method are as follows:

  • The proposed method (named as CardioHelp) predicts the probability of the presence of cardiovascular disease in a patient by incorporating a state-of-the-art deep learning algorithm called convolutional neural network.

  • To our best knowledge, this is the first-time deep learning model applied in the medical field for predicting a coronary heart disease (CHD) which works with just 14 attributes.

  • We prepared the heart disease dataset and compared the results with state-of-the-art methods and achieved good results.

The rest of the paper is organized as follows; Sect. 2 is a brief description of the research work already carried out in the prediction of heart diseases. Section 3 elaborates on the proposed framework. Section 4 is a description of the incorporated dataset and its attributes that are used in this research work. Section 5 is a detailed description of the achieved results.

2 Related Work

A significant amount of research work in this field has already been carried out by various researchers. This section performs an overview of the research work already carried out in the prediction of heart disease with the help of advanced ML algorithms and big data technologies. Various pattern recognition and data analysis methods for model prediction to diagnose cardiovascular diseases have been used by different research groups. State-of-the-art machine learning algorithms like K-nearest neighbor algorithm, Naïve Bayes classifier, genetic algorithm, decision trees, ANNS, and deep neural network was used for carrying out the experimental work. It is noted that in the majority of the researches performed when increase the features and combinations of the above-mentioned techniques are used the accuracy calculated is high [16]. A significant amount of research has already been carried in the prediction of cardiovascular disease.

Let us now briefly review some of the important and relevant work done by various research groups in this area. In a similar context, Qrenawi, Mohammed et al. [18] have successfully applied an ontology-based data mining technique on a special dataset of diabetics with cardiovascular disease. This research work was carried out to identify any relationship between the history of ‘type two’ diabetic patients and their laboratory tests advised by a medical practitioner. A later phase of the research work focuses on using some frequent pattern discovery and ontology-based, rule induction methods incorporating data mining algorithm named as RMonto. The key outcome of this research is that the usage of an ontology-based technique lessens the number of attributes/properties in the phase of preprocessing and aids in most of the data mining phase. Additionally, the authors successfully achieved an accuracy of 90%. Nguyen et al. [19] proposed a real-time deep learning framework for heart disease classification in an IOT-based medical environment. The heartbeat signals collected from ECG devices are decomposed into the wavelet coefficient using WPD (Wavelet Packet Decomposition) algorithm and features are extracted by applying a WPCA (wavelet-based kernel PCA). A deep neural network based on backpropagation using three hidden layers having 80 40 20 nodes used to classify the heart disease. In [17], an expert system using deep neural networks is developed by H. Van Pham. The knowledge base is represented using fuzzy rules and updated according to the doctor’s preference to improve the database and help them in making the right decision based on heart disease risk level.

Various researchers have successfully applied intelligent machine learning algorithms in decision support systems. For instance, we may refer to the research work carried out by Isra’a Ahmed et al. [20], who developed an intelligent medical decision support system based on data mining techniques. This work incorporates a total of five data mining algorithms with large datasets. The main purpose of this incorporation was to access and analyze the risk factors related to heart diseases in the context of statistical analysis for the identification of heart disease. This made it easy to compare the performance of different implemented classifiers, i.e., Decision Tree, Random Forest, Naïve Bayes, Discriminant, and support vector machine. The selected classification models were implemented on two different datasets to show that their approach is practically viable. The research paper concludes that all classification algorithms are fairly predictive and can give a nearly correct answer. However, the authors noted that the decision tree outperforms other classifiers. The random forest algorithm stands second in this categorization. Modern research is now able to predict the development of a heart attack or a heart disease before its occurrence. A risk factor-based approach proposed by Kanchan More et al. [21] can predict the risk of developing heart attack by using smartphone technology. They developed an android application which is integrated with a clinical database constructed by data of more than five hundred patients admitted in a cardiac hospital, with the final diagnosis. A correlation of the presence of ischemic heart disease (IHD) with available data is performed and various risk factors like diabetes, hypertension, smoking, dyslipidemia, family history, stress, obesity, and existing clinic symptoms were considered. The data mining technology was used to mine the data and a nominal score was generated. Three classes of risk were created i.e., low, medium, and high for IHD to the score. The data of 89 participants with acute coronary syndrome (ACS) were used to test the performance of the android app. While comparing the patients with data generated by scores, they found a substantial correlation of having a cardiac incident when a comparison of the high and low category was p = 0.0001. 89% of patients with high category had IHD while only 12.5% of patients with low category were found to have IHD. Similarly, a significant difference between high and medium was found with p = 0.0001. An observation demonstrated that 86.7% of patients with ACS had high scores.

We have another example of a software application that successfully predicts the occurrence of a cardiac attack. Application software was developed by Ammar Aldallal et al. [22]. Medical practitioners could be able to predict the occurrence of non-communicable diseases (NCDs) by using this software. The proposed software application was examined by using the patients’ records gathered from the Bahrain Defense Force Hospital. Practitioners were asked to execute and test this application in the mentioned hospital. The research work claimed that the proposed prediction model can instantly predict NCDs with a satisfactory level of effectiveness, and efficiency. The application is claimed to be able to help a medical practitioner to make proper decisions toward patient health risks. In addition to the software-based approaches, some researchers also focused on developing a sensor-based and hardware-oriented approach to analyze the factors involved in the development of heart disease. A similar sensor-based approach is proposed by Johanna O’Donnell et al. [23]. The authors proposed that the data collected from multi-sensor patch data can be used for analyzing heart failure. These data were collected by multi-sensor patches from patients suffering from heart failure. The authors claim that this data can be helpful initially analyzing the sleep patterns and activities of the heart failure patients by the severity of heart disease. Chest-worn multi-sensors were provided to thirteen heart failure patients, and they were asked to wear the devices for almost seven consecutive days. Multi-sensor data gather from eleven out of those thirteen patients were found to be of high quality, and included in the analysis. A potential difference between heart rate, sleep angle, wake-time activity was found among heart failure patients with different severities. The authors proposed that larger studies are important to coherently analyze the role of activity and sleep as markers for the occurrence of potential heart failure. Mehmood et al. [24] performed prediction of a potential heart attack by using the attributes extracted from the dataset taken from the UCI repository. The authors insisted on the importance of attribute extraction techniques for mining information for the prediction. They added that various patterns can be derived to predict heart disease earlier by utilizing attribute extraction techniques. Several techniques in Artificial Neural Network (ANN) are explained in this research work. The paper shows that ANN gives an accuracy of 94.7, but the principal component analysis improved the Accuracy to 97.7%. The utilization of the data mining technique is also performed by Alizadeh-dizaj et al. [25]. The authors investigated the performance of data mining algorithms and predicted the risk of stroke in suspected stroke patients using a decision tree based on the risk factors that affect it. The main database contains 1184 records. In the modeling phase, the Classification Tree, Naïve Bayes, Neural Network, SVM, and KNN algorithms have been used. This study states that physical inactivity, high cholesterol, cardiovascular disease, history of transient ischemic attack, history of the previous stroke, and high blood pressure were the most effective variables while predicting potential heart disease. This research work established a decision tree that developed some rules that can be used as a model for predicting the risk of stroke in patients. The claimed accuracy of the model was 95.52%. The research work shows that it is possible to determine the stroke risk for a new sample with specific characteristics by applying the rules. A method to investigate the performance of different classification algorithms such as DT, NB, K-NN, and NN on heart disease dataset was proposed by T. John Peter et al. [26]. They classify the patients’ records and predict who is having heart diseases. After applying various classification algorithms, the authors found that the Naïve Bayes classifier gives better accuracy than other classifiers. For the sake of efficiency, the authors reduced the dimensionality of data using the attribute selection methods.

A k-means clustering algorithm has been used by researchers [27] incorporating a warehouse for heart diseases. They applied MAFIA (Maximal Frequent Itemset Algorithm) for calculating the overall significance of the most occurring patterns leading to heart attacks. Another neuro-fuzzy algorithm incorporating a layered approach was proposed in [28] which helped predict the occurrences of coronary heart diseases. The approach was simulated in MATLAB tool and shown significant results with a very low error rate and a higher efficiency. Another technique incorporating the association rule mining algorithms was introduced in [29] which makes use of transactional clustering datasets with a sequence number for predicting the likelihood of potential heart disease. The said technique was implemented in the C programming language. The key idea of the technique was to use smaller cluster sizes which helped it increase memory usage efficiency. There is a vast variety of research works carried out for predicting potential heart disease, for instance, a research group incorporated some advanced machine learning and data mining techniques like genetic algorithm and fuzzy logic to predict and analyze a potential heart attack [30]. In addition to working on single and standalone machine learning algorithms, some researchers worked on hybrid models as well [31]. They introduced a technique which makes use of ANNs and machine intelligent hybrid algorithm to predict the occurrences of a heart attack. Another research [32] introduces a prototype that makes use of Naïve Bayes and Weighted Associative Classifier (WAC) which helps in efficiently predicting the probability of the patients receiving a heart attack. A research work proposed a solution to the heart attack prediction problem by developing a web-based intelligent system that uses Naïve Bayes classifier to diagnose heart disease by answering a set of complex queries. A technique incorporating Association Rules was presented in [33], incorporating advanced data mining techniques to improve the overall accuracy for predicting heart disease. The research work proposed in [34] incorporates an algorithm with search constraints. The ideas were to minimize the total association rules and validate the approach used for training and testing of the model. A comparative study [35] of different data mining algorithms like support vector machines, artificial neural networks, and decision trees was presented which developed a model that predicted more than 500 cases. A popular dataset with 15 attributes was incorporated in [36] which uses Naïve Bayes classifier and ANNs for prediction of heart disease. Additionally, the main of focus of some recent researches is to predict a CVD or CHD by incorporating decision trees [37], significant features, and ensemble learning (SFEL) model [38], and sparse autoencoder-based approach [39]. In this regard, a recent survey on the techniques for predicting a CVD has also been presented in [40].

3 Methodology

This work primarily focuses on the prediction of heart disease with the help of a well-established dataset and a state-of-the-art machine learning algorithm called Convolutional Neural Networks. We name the proposed framework as CardioHelp incorporating a state-of-the-art dataset available at [41]. The following subsections contain a detailed description of the proposed technique.

3.1 Majority Voting with LASSO Shrinkage

The least absolute shrinkage or LASSO is a technique involving the regression method which is effectively used for the regularization and selection of variables to improve the prediction accuracy and interpretability of the produced model. The LASSO technique shrinks the data values to a central point Pc and helps in the elimination of parameters and selection of variables. This type of regression technique is well-suited for highly multicollinear models. A penalty equal to the absolute value of coefficients magnitudes is added by the LASSO regression, out of that, some coefficients are eventually eliminated from the model after becoming zero, resulting in a model with fewer coefficients due to variable elimination. Due to their quadratic nature, the LASSO solutions work toward a unique goal, i.e., to minimize:

$$ \mathop \sum \limits_{j = 1}^{n} \left( {y_{j} - \sum\limits_{j} {x_{jk} \gamma_{k} } } \right)^{2} + \lambda \mathop \sum \limits_{k = j}^{q} \left| {\gamma_{k} } \right| $$
(1)

As mentioned in Eq. (1), these results in an interpretation easily understandable by the regression model because a subset of values (denoted by γ) becomes zero after completion of the shrinking process. The above equation includes a parameter \(\lambda\), which is a tuning parameter denoting the amount of shrinkage. Keeping that in mind, we can say that no parameters are removed from the model when \(\lambda = 0\). More coefficients are set to zero and eliminated from the model concerning the increase in \(\lambda\). The variance is increased with a decrease in \(\lambda\), and bias is increased with an increase in \(\lambda\). The importance of variables in terms of their contribution to the underlying variation is interpreted as γ value for a variable or factor. The variable with γ = 0 is considered unimportant and resultantly ignored. It should be noted that lack of balance in the dataset (data imbalance) results in misleading results by LASSO regression, which may lead us to incorrect selection of important variables with LASSO performed on the whole dataset.

The effect of imbalance can be mitigated if a strategy is adopted which randomly subsamples the dataset and iterates the LASSO several times. Majority voting is incorporated on γ values to select the nonzero variables in most of the iterations. Let us understand it with the help of a scenario. Assume that N randomly subsampled dataset is used to perform LASSO N times in which every instance contains equal examples of CHD or non-CHD. If we have 45 variables, we can get γj = [γj1, γj2, γj3, … …, γj45] at a jth instance. We decide the inclusion of a variable v in further analysis by simply counting several nonzero instances of that variable and a manually set threshold value. Alternatively,

$$ \begin{aligned} & x\left( \gamma \right)\left\{ {\begin{array}{*{20}c} 0 &\quad \text{if}\; \gamma = 0 \\ 1 &\quad \text{otherwise} \\ \end{array} } \right. \\ & \left[ {x\left( { \gamma_{1,d} } \right), x\left( { \gamma_{2,d} } \right), x\left( { \gamma_{3,d} } \right) \ldots \ldots x\left( { \gamma_{N,d} } \right)} \right]1 \ge \frac{m}{\alpha } \Rightarrow d\;{\text{is}}\;{\text{selected}} \\ \end{aligned} $$
(2)

3.2 Convolutional Neural Networks (CNN)

Prediction of a CHD in a patient should be referred to as a binary classification task. While talking about supervised learning, the neural network has been proven to be an effective classifier under certain conditions [11]. In recent research, neural networks with application-specific settings like several hidden layers, have given significant improvement in several areas. The application areas of neural networks with significant results include image processing, speech processing, and time series prediction [42]. Relatively bigger datasets have been incorporated into rigorous training and fine-tuning of various deep learning architectures. The working of an artificial neural network consists of the transformation of input data over hidden layers and estimation of error at the output layer [43,44,45]. A gradient descent algorithm then uses the backpropagated error by the output layer for the iterative update of the layer weights. Several improvements in the gradient descent algorithm have been proposed by various experimentations and analyses including reduction in overfitting, scheduling the training process, making the layers nonlinear, visualization of the hidden layers, and other modifications. Despite the notable success in its applications, the working of deep neural networks is still insufficiently understood. It is also found that the training networks are easily overfitted owing to millions of parameters in a deep architecture. The problem gets worse when the examples are insufficient.

Various algorithms have been proposed to solve this issue. Data augmentation [46, 47] is one of those widely used techniques that artificially populates new small datasets depending on the existing examples. Although this technique generates relatively better examples, however, such procedures are not credible while we are specifically talking about a biological application like clinical datasets. An example scenario can be the augmented measurements of a CHD phenotype, such as platelet count might not correspond to the possible range of readings of the patient. The reason behind this scenario is the fundamental difference between the foundations of platelet count readings and the principles of the statistical generation. Small or skewed datasets lead to poor training, which resultantly leads to poor and inaccurate classification. A wrong prediction in medical research accompanies a significantly bigger penalty as compared to other applications like semantic labeling, image synthesis, chat-bot configuration, etc.

Due to poor prediction strategy, a patient having CHD may be left untreated which may resultantly lead to wrong therapeutic medication. Because the accuracy of a prediction model is critical in medical application, so one of the prime objectives of this research work is to improve the accuracy of the classification i.e., predicting the presence or absence of CHD in a subject more accurately. We propose a shallow CNN to overcome these problems. As depicted in Fig. 1, every two fully connected layers of the mentioned shallow CNN have sandwiched the convolution layers.

Fig. 1
figure 1

Proposed CNN architecture

3.3 Convolutional Neural Network Architecture

As depicted in the figure below, the architecture of the proposed convolutional neural network (CNN) is a feedforward network which works on a sequential single–input–single–output fashion. For binary classification experimentation, we assume that patients with the presence of CHD will be classified as ‘1’, and others (with CHD absent) will be classified as ‘0’. An experiment multi-class classification is also performed will be discussed later. As mentioned earlier, the number of active CHD attributes (phenotypes) obtained from the majority voting algorithm is 14. Let us proceed with an assumption that the number of training examples is N, so the input layer indicated in Fig. 1 has \(R^{N \times 14}\) dimension.

The 14 selected variables by LASSO-M voting and bias are combined in the fully connected ‘Dense’ layer with 64 neurons. This layer effectively normalizes various variable types before the nonlinear transformation, which is done by a rectified linear unit or ReLU. Overfitting is reduced by 15% dropout. A cascaded set of convolution layers is followed by the fully connected ‘dense’ layer. Two filters with kernel size 4 and stride 2 are introduced in the first convolutional layer. The external zero paddings are not provided in this layer. To find the best average pooling size under all constraints, various experiments with different pooling strategies are rigorously performed in the pooling layer. The output of the fully connected layer block \(\in { }{\mathcal{R}}^{N \times 64}\) is converted into a tensor of \({\mathcal{R}}^{N \times 64 \times 1}\) dimension in the first convolution layer. The converted tensor is then further involved in various manipulation activities like nonlinear transformation, average pooling, and batch normalization to generate an output tensor of \({\mathcal{R}}^{N \times 31 \times 2}\) dimension.

The last convolutional layers with no zero paddings contain 4 filters that are taken with kernel width 6 and stride 2 to generate an output tensor of \({\mathcal{R}}^{N \times 13 \times 4}\). This output tensor is then delivered to average pooling layers, which forwarded it to the next dense layer after average pooling. The loss function is set as the categorical cross-entropy loss in the ‘softmax’ layer in which we can observe the categorical output as well. In each layer, we initialize bias with some random numbers that are drawn from a normal distribution with variance \(\frac{1}{2\surd n}\), where n is the number of connections to the layer coming from the previous one. The Adam optimizer is used with \(\partial_{1} = 0.1, \partial_{2} = 0.988,\;{\text{ learning}}\;{\text{rate}} = 0.005\), and decay = 0. Experiments with various hyperparameters associated with the proposed model are performed to obtain accurate consistency of classification. During the training phase, the results are provided by varying epochs, the number of neurons, class weights, and subsampling of input data. The said manipulation is done in each dense layer except the last one. Additionally, the number of filters also varies during the training phase.

3.4 Training Schedule

Although the dropout layers are used in the proposed CNN models, this training schedule is also used to improve the classification accuracy and further reduce overfitting. The concept of penalty helps the algorithm know its deficiency resultantly improve its working. Here, the class weight ratio is adjusted as a penalty because of imbalance in a class. We define it as a ratio to CHD and a Non-CHD dataset. For instance, a class weight ratio of 5:1 shows that we penalize a wrong classification of a CHD training sample 5 times more than a wrong classification on a Non-CHD sample during the calculation of error after each epoch before the backpropagation stage. This is done for initial training of the model with 1: N to a large number of epochs and then gradually increasing the weight ratio with a sudden decline in epochs. Let us assume that the actual class weight is one, which is taken as a factor \(\varphi_{0} . \)

figure a

Although we use dropout layers in the proposed CNN model. We also used this training schedule to further reduce possible overfitting. The intuition is to initially train the model with a 1: Ν weight ratio for a sufficiently large number of epochs and then, gradually increase the weight ratio with a steady decline in the number of epochs. Let the actual class weight ratio is ρ0: 1, which we take as a factor ρ0. Figure 1 is a pictorial view of the training layer, while Fig. 3 depicts the prediction layer of the proposed framework. We describe both the layers in detail in the following sub-sections.

The main focus of the training layer is to prepare a trained model that can be later used for predicting heart disease. Training of the learning model can be done by incorporating the output labels of the available training data. We use 30% of the rows for training purpose and the remaining 70% of the records shall be used for testing and validation purpose. As Fig. 1 shows, the training layer consists of three distinct steps named as leading and cleansing, model training, and storage of the trained model for later use. Below is a short description of each step. The prediction layer focuses on the testing of datasets for the prediction of potential heart disease. The remaining 70% of the dataset (without output labels) is provided to the training model for the prediction and testing step. The main focus of the training layer is to prepare a trained model that can be later used for predicting heart disease.

4 Experimental Results and Discussions

4.1 Dataset

In this section, we briefly describe the dataset incorporated in the experimental work of this study. As mentioned in previous sections, we make use of a state-of-the-art dataset especially available at [41] for this purpose. This set of attributes is a subset of a dataset compiled by medical practitioners in African countries. We incorporate only 14 attributes from this dataset to predict the presence of a CHD in a subject. Table 1 depicts a list of attributes used in the algorithm with their short description and the possible range of values wherever applicable.

Table 1 List of incorporated attributes

4.2 Experimental Setup, Results, and Discussions

This section elaborates on the experimental setup, underlying platform, along with experimental parameters with hardware and software platforms. We tried to correlate the features found in the dataset to predict the heart health of the patients. Interesting facts are found during the experimentation that is discussed in the subsections.

4.2.1 Experimental Setup

The experimental setup is established with the following parameters to observe and explore the performance of convolutional neural network (CNN) with the standard dataset available at UCI standard repository [41].

4.2.1.1 Hardware Requirements

To carry out this research work and analyze the results, we establish an experimental environment on a personal computer. We equipped the experimental workstation with an Intel Quad-Core i7 4th generation processor, working at a clock rate of 2.3 GHz with an L1 cache of 32 KB, L2 of 256 KB and L3 cache memory with 4 MB of size. Sixteen gigabytes of DDR3 RAM are installed in the workstation and a total of a SATA hard disk having 1 TB capacity, rotating at 7 K RPM is installed.

4.2.1.2 Software Requirements

This work incorporates Microsoft Windows 10 Pro as the base operating system. MATLAB version 2015a.

4.2.2 Performance Metrics

One can evaluate the performance of the classifier by using some performance metrics. In machine learning, various criteria are available for evaluating the classifier’s performance. Some of these criteria are explained as follows.

4.2.2.1 Precision

Precision is the measure of exactness for evaluating the performance of a classifier. If the precision is high, it means there are fewer false positives. In a model with lower precision means, there are more false positives.

$$ {\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}. $$
(3)

Here in Eqs. (3) and (4), TP stands for true positive, whereas FP denotes false positive. The false positive will be denoted by FN in the subsequent subsection.

4.2.2.2 Recall

It is a measure to determine the completeness of the classifier. Higher the recall, lesser the false negatives, and lower the recall, higher the false negatives. Improvement in recall often results in a decrease in precision.

$$ {\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}} $$
(4)
4.2.2.3 F-Score

The combination of accuracy and recall is called F-score which can be calculated by the following formula:

$$ F1{\text{ Score}} = \frac{{{\text{Recall}} \times {\text{Precision}}}}{{{\text{Recall}} + {\text{Precision}}}} $$
(5)

As depicted in Eq. (5), the F1 score can be obtained by dividing the product of recall and precision with the sum of recall and precision.

4.2.3 Data Preparation

As already mentioned in the previous section, the preparation of datasets according to the nature of experimentation to be performed is necessary before initiating the experiments. Data preparation step duly involves fetching the dataset into the file system by a third-party program or is directly saved into the file system with the help of a Python Library named SKLearn.

A noticeable amount of noise and some inconsistencies were observed in the dataset. This required the data to be cleansed and made consistent, which is done by writing various programs in Python programming language. Various programs are written for different tasks like data cleansing, replacement of anomalies, calculation of mean values and normalizers, etc. to translate the data into a cleansed feature matrix. These steps are already discussed in the Proposed Framework section of the previous section.

4.2.4 Data Classification Using CNN

4.2.4.1 Heart Disease Classification

Once the dataset is presented in the form of a feature matrix, the next step is to distribute this dataset into different classes. In our first case, we classified the dataset into two classes i.e., Heart_Disease and No_Heart_Disease. A classifier model is then generated by using CNN to classify the data into binary classes and the model is stored on the file system. The computed accuracy for binary the classification is 97% as depicted in Fig. 2.

Fig. 2
figure 2

Confusion matrix for binary classification

Table 2 depicts the performance of the proposed model concerning the precision, recall, F1 score, and accuracy. The overall accuracy of the proposed model is 97% which is seen in Table 2. Another classifier was trained for four classes as well, representing the types of heart disease present in the stored dataset. A program written in Python programming language was executed to determine the number of existent types. The program searched the data and classified the dataset into 4 classes i.e., Type 1 Disease, Type 2 Disease, Type 3 Disease, Type 4 Disease. After determining the number of classes from the dataset, the job automatically labeled the records into the nearest related type of class. The trained model is tested for 4 classes, and the results were represented with the help of the confusion matrix, as predicted in Fig. 4. The predicted accuracy of 4 classes is 87% which is higher than the results presented in [37]. The confusion matrix presented in Fig. 4 shows that a total of 114 records were classified into type 1 class. Similarly, 77 records were classified into type 2 class, and a total of 22 records were classified into type 3 class (Fig. 3).

Table 2 Performance of the binary classifier
Fig. 3
figure 3

Confusion matrix for multi-class classification

Fig. 4
figure 4

Correlation matrix among predictor variables

Additionally, a total of 47 records are classified into type 4 class. The total false classified records for type 1 disease are 8, for type 2 disease, a total of 26 records were false classified. The false classification concerning type 3 disease is 6, whereas type 4 disease has no false classifications. The values of precision, recall, F1-score, and accuracy of 4 classes are shown in Table 3:

Table 3 Performance overview of the model with four classes

4.2.5 Correlation Among Independent Variables

As discussed earlier, variable selection is an important step in our classification model preparation. We started by investigating of correlation among 14 predictor variables with continuous values. A high correlation (0.58) has been found between the old peak and slope (description of the variables is given in Table 1) as depicted in Fig. 4). However, it is reported in the literature that AST (slope) has a significantly higher value in patients with CHD, so it can be used as a good predictor or biological marker to predict the severity of a CHD in a patient. A high correlation (0.50) has been found between old peak and num (percentage of narrowing in diameter) meaning that increase in ST while exercise indicates that the patient may have 50% or more narrow in diameter of arteries which seems quite logical.

Talking about narrowing in diameters of arteries, a positive correlation coefficient (0.52) between Thal and num is also observed, which shows that these two parameters are positively related to each other. Meaning that higher Thal values increase the probability of more narrow arteries in the subject. It is also observed that there is a negative correlation (− 0.33) between Cp (chest pain) and thalach (max heart rate). A positive correlation (0.28) has been found between trestbps (blood pressure) and age, depicting that the blood pressure of a CHD may increase with the increase in age. A positive correlation (0.17) between chol (cholesterol level) and restecg (the result of electrocardiography) is found, which means that CHD patients with higher cholesterol values may have abnormal ECG values as well. Because of their importance of incorrectly determining the predictor variables, we performed LASSO regression because of the significance and association of risk factors with CHD. Because of the significance of risk factors and their association with correlated variables, while predicting a CHD, we performed LASSO regression to correctly identify the predictor variables for further analysis.

4.2.6 Comparison with Other Techniques

As the proposed framework suggests, the trained classifier is then used for the prediction of the whole testing dataset.

Table 4 and corresponding Fig. 5 depict a performance comparison between the proposed framework and other works concerning the accuracy outcomes. It can be observed from the said figure that the proposed approach outperforms the other mentioned techniques concerning performance i.e., the overall accuracy. It can also be observed that the proposed work is followed by the decision tree-based approach proposed in [25].

Table 4 Performance comparison of the proposed model with existing methods
Fig. 5
figure 5

Performance comparison in terms of accuracy

5 Conclusion and Future Work

Prediction of heart diseases at earlier stages may prevent possible deaths due to heart attacks. A good classification algorithm may help the physician predict the presence of cardiovascular disease before its actual occurrence. This research focuses on predicting a possible heart disease by incorporating a state-of-the-art dataset available at UCI repository, and convolutional neural networks (CNN). This dataset consists of some cardiac test parameters as well as general human habits. The results show that the proposed model outperforms the existing techniques referred to in this paper. The overall accuracy of the proposed model is 97%. In the future, we aim to further enhance this research work by predicting the occurrence of other major diseases like cancer and other brain-related diseases.