1 Introduction

Data science, of late, has gotten tremendous attention in academia and research circles. With cloud computing and big data, it became a reality to have data science approach. The application of this approach in the form of big data analytics in the healthcare domain has the potential to make a huge impact on the stakeholders of the industry. Besides, it results in reducing the cost of healthcare service, enhancing Quality of Service (QoS) and reducing error and waste as well [5]. Many contributions were found in this regard. Iqbal et al. [1] presented a data science approach that is used to analyze cyber-physical systems in order to have better security with the usage of computational intelligence and data analytics. They used fuzzy logic in their method to enhance intelligence. Mehta and Pundit [5] explored the concurrence between healthcare requirements and big data analytics. They insist on having a shift in the culture of healthcare units by using technology-driven approaches. Ngiam and Khor [7] explored different algorithms associated with machine learning (ML) for big data in the healthcare industry. They found that algorithms can help in disease diagnosis and different interventions related to healthcare units.

Sahoo et al. [9] studied an “intelligence-based health recommendation system using big data analytics”. They found that every recommender system has its phases such as data collection, learning, recommendation and feedback. They reported different kinds of recommender systems that are based on content-based, model-based and hybrid filtering techniques. Dlamini et al. [20] used disease case and oncology case studies to know the benefits of the usage of big data. From the literature, it is understood that there has been the usage of big data analytics in the healthcare domain. However, the literature found prediction systems and recommendations separately. It is more useful if the system is capable of predicting disease and also providing recommendations. Moreover, proposing a novel feature selection also helps in improvement. This is the basis for the work in this paper. Our contributions in this paper are as follows.

We proposed Disease Prediction and Drug Recommendation Framework (DPDRF). This framework is based on supervised machine learning. The framework has two phases namely training and test. In the training phase, which can be done offline, number of ML models are used to learn from labelled data. Such learning process provides required knowledge for automatic disease prediction and drug recommendation. The models used in the training phase are persisted to reuse them later. This process also avoids repetition of training models. The saved models can be used without reinventing the wheel again thus leading to faster convergence in disease detection. In the testing phase, which is considered online, unlabelled data is used to perform desired classification.

We proposed an algorithm known as Cardio Disease Prediction and Drug Recommendation (CDP-DR) to realize the framework. This algorithm uses dataset and also ML pipeline. After pre-processing the given data, the data is splint into training and test sets. Then the training data is subjected to feature selection which plays crucial role in identifying features with higher importance. This process is important as it improves quality of training. The feature selection is done by invoking EG-HFS algorithm proposed in this paper. In other words, CDP-DR makes use of EG-HFS for selecting contributing features. Afterwards, this is an iterative process to train each ML model in the pipeline using training data and perform classification using test data.

An algorithm known as Entropy and Gain-based Hybrid Feature Selection (EG-HFS) is defined to leverage quality of training leading to performance enhancement of prediction models. This algorithm is a filter based approach that considers correlation of features with the target class. It is a hybrid approach where entropy and gain based measures are exploited to reap benefits in identifying features.

The remaining sections of the paper are structured as follows. Section 2 puts a light on big data analytics and its usage in the healthcare domain by reviewing related works. Section 3 presents Disease Prediction and Drug Recommendation Framework (DPDRF). Section 4 presents results of CDP-DR compared with the state of the art. Section 5 concludes the utility of CDP-DR while giving a future scope of the work.

2 Related work

This section reviews the literature on data analytics and predictions in solving real-world applications and particularly in the healthcare domain.

2.1 Data science approaches

Iqbal et al. [1] presented a data science approach that is used to analyze cyber-physical systems in order to have better security with the usage of computational intelligence and data analytics. They used fuzzy logic in their method to enhance intelligence. They intend to apply it to different application areas in future. Iqbal et al. [2] did a similar kind of research as in [1] but extended it to different domains. Galesti et al. [11] explored data science in terms of resources, data types, different techniques for analysis and the potential benefits of using data science in healthcare service units. Pramanik et al. [12] investigated on privacy issues to be considered while using data science analytics. Bag et al. [13] focused on the role of different factors in the manufacturing industry to adopt data science. The factors are associated with resources and economy and institutional pressures. Banerjee et al. [14] explored trends in Internet of Things (IoT) and associated data science for healthcare and biomedical technologies. They found that with data science, it is possible to add more value to healthcare organizations. Mikalef et al. [15] used a hybrid method to ascertain the relationship between data science based analytics and the performance of a given organization. Su et al. [16] used data science approach to leverage the representation of carbon emissions. Patel et al. [17] found that data science has the potential to improve the system of sports. Ma et al. [18] employed data science approach to tourism for obtaining interesting facts. Zhang et al. [19] focused on cognitive data science to ascertain negativity in public emergences. Dlamini et al. [20] used heart disease and oncology case studies to know the benefits of the usage of data science.

2.2 Big data analytics approach

Anisetti et al. [3] proposed big data based methodology as a service for policies associated with public health in urban areas. It could improve the public health policy-making process. Palani Samy and Thirunavukarasu [4] explored in the implications of having frameworks pertaining to the healthcare domain. Different stakeholders benefited from such frameworks include patients, medical practitioners, hospital operators, pharma and clinical researchers and healthcare insurance providers. Mehta and Pundit [5] explored the concurrence between healthcare requirements and big data analytics. They insist on having a shift in the culture of healthcare units by using technology-driven approaches. Wang et al. [6] investigated on the potential benefits of big data analytics and its capabilities pertaining to the healthcare domain. They found that big data analytics is capable of leveraging business intelligence (BI)and use the modern computational infrastructure. Their architecture has different layers like data layer, data aggregation layer, analytics layer and knowledge exploration layer. These layers are on the top of the data governance layer. Ngiam and Khor [7] explored different algorithms associated with machine learning (ML) for big data in the healthcare industry. They found that algorithms can help in disease diagnosis and different interventions related to healthcare units. Galesti et al. [8] studied big data approach to healthcare and came to know its value to organizations and challenges for society and organizations as well. Sahoo et al. [9] studied on “intelligence-based health recommendation system using big data analytics”. They found that every recommender system has its phases such as data collection, learning, recommendation and feedback. They reported different kinds of recommender systems that are based on content-based, model-based and hybrid filtering techniques. Email et al. [10] focused on smart big data analytics that involves clustering in traditional ML and also clustering in distributed architectures.

2.3 Recent methods

Akkem et al. [25] explored AI based methods and their significance in solving problems of the applications in different domains. Rahim et al. [26] proposed an integrated approach for cardiovascular prediction. Their methodology was based on ML techniques. It has provision for learning from historical data and perform classification of diseases in the newly given test data. however, their methodology lacks feature engineering approach. Bertsimas et al. [27] also studied the utility of ML models in heart disease prediction. Their approach includes the usage of best performing ML models that are used for detection of heart diseases. In [28], the research not only includes heart disease prediction but also has provision for severity identification. Ali et al. [29] used many ML models for cardiovascular disease diagnosis while Das et al. [30] used ML models to investigate on the diagnosis potential of the same. From the literature, it is understood that there has been the usage of big data analytics in the healthcare domain. However, the literature found prediction systems and recommendations separately. It is more useful if the system is capable of predicting disease and also providing recommendations. Moreover, proposing a novel feature selection also helps in improvement. This is the basis for the work in this paper.

3 Proposed framework

This section presents the details of the proposed framework, underlying algorithms and evaluation methodology. A data science approach for disease prediction and drug recommendations under healthcare system is followed with a cardio disease case study. The framework presents the methodology involved in the supervised learning based approach in the process of identification of presence of disease. Based on the disease identification, drug recommendations are generated. Since healthcare industry is crucial for the human wellbeing, the healthcare domain is chosen, particularly, cardio disease detection and drug recommendations.

3.1 Disease prediction and drug recommendation framework

Machine learning models have been around for prediction and classification to solve different kinds of problems. Machine learning algorithms have required modus operandi to learn from training samples and render desired business intelligence. Supervised learning has two phases such as training phase and testing phase. In the training phase, the machine learning model is trained from the training samples. In the testing phase, the unlabelled samples are used as training set and the algorithm predicts class labels. In case of unsupervised learning, there is no training given to algorithm explicitly. On the other hand, semi-supervised methods will have both supervised and unsupervised learning possibilities. The proposed framework in this paper is based on supervised learning. However, the problem with the supervised learning approach is that it depends on quality of training data for better performance. Therefore, if the training data has no good quality, then the performance of supervised learning models gets deteriorated. In order to overcome this problem, in this paper, we proposed a feature selection algorithm that could identify features that can contribute to class label prediction.

The framework is realized by defining an algorithm known as Cardio Disease Prediction and Drug Recommendation. This algorithm in turn uses different supervised machine learning (ML) algorithms. They are known as Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), SGD, Gradient Boosting (GB) and XGB. DT generates understandable rules in the form of tree which is meant for disease classification. RF is an ensemble model which makes use of multiple DTs and voting approach to determine class label. LR model is based on the probability of given instance belonging to a class. SGD model has needed optimization of internal models for better performance. GB has ensemble approach to determine class labels. XGB makes use of gradient boosted decision trees for leveraging classification performance. Another algorithm known as Entropy and Gain-based Hybrid Feature Selection is defined to leverage quality of training leading to performance enhancement of prediction models. The overview of the framework is shown in Fig. 1.

Fig. 1
figure 1

Overview of the proposed data analytics framework based on data analytics

The framework makes use of training set which has class labels mentioned in one of the attributes. Such data, in other words, has ground truth or diagnosis or class label known beforehand. Every instance (record of a patient) has different attributes and also diagnosis column in the training set. Such training set is used to train a classifier. However, before training the classifier, the hybrid feature selection algorithm is used in order to find features that can contribute to prediction of class labels. Such features are only used for training the classifier. This will improve quality of training as the feature selection gets rid of irrelevant and redundant features. Once training is completed, a knowledge model is created for further processing in testing phase. This knowledge model is used by the classifier to predict class labels when test data is given. In the testing phase, the test data which has no class labels mentioned is given as input. It is subjected to hybrid feature selection to identify useful features. Then features are given to the knowledge model or prediction model resulted in the training phase. The model is capable of predicting class labels. In essence, the given testing samples are labelled or classified into cardio disease and no cardio disease.

The input dataset is dataset [21] is divided into the training set and testing set. The training set is subjected to a hybrid feature selection method known as Entropy and Gain based Hybrid Feature Selection (EG-HFS) which takes all features extracted as input and result in features that are highly relevant. Different classifiers are used to have prediction models. In the training phase, the extracted features are learned by the proposed algorithm known as Cardio Disease Prediction and Drug Recommendation (CDP-DR) that is capable of providing predictions and recommendations. In the testing phase, the extracted features are learned by the resultant model and the labelling is made. In EG-HFS different metrics are used as expressed in Eqs. 1 to 4.

$$\text{SU}=\frac{2*Gain}{H\left(x\right)+H\left(y\right)}$$
(1)

As in Eq. 1, there is composite metric derived by using entropy and gain metrics. The computation of H(X) is expressed in Eq. 2.

$$\text{H (X)}=-{\sum }_{x\in X}^{}p\left(x\right)log \ p\left(x\right)$$
(2)

In the same fashion, the computation of H(Y) is done as expressed in Eq. 3.

$$\text{H (Y)}=-{\sum }_{y\in Y}^{}p\left(y\right)log \ p\left(y\right)$$
(3)

The gain metric is computed as in Eq. 4.

$$\mathrm{Gain}=\mathrm H\;(\mathrm y)-\mathrm H\;(\mathrm y/\mathrm x)$$
(4)

The entropy and gain measures when combined will have symmetric uncertainty which is the combined metric that is used to determine whether a feature can contribute to the class label prediction or not. As the two measures are combined, it results in a hybrid metric or hybrid approach in class label prediction. Hence, the proposed feature selection method is named as Hybrid Feature Selection (HFS).

The methodology is based on supervised learning phenomenon which has two distinct phases. In the training phase, the proposed system has mechanisms to pre-process the given data and split into training set (T1) and testing set (T2). The system makes use of number of ML models for disease diagnosis. These ML models are trained with the training data. however, prior to training, the training data is subjected the proposed feature selection method. The feature selection method makes use of a composite filter method which computes importance of each feature. Based on the feature importance, only contributing features are chosen. In fact, training is given to ML models based on the selected features only. It has potential to improve quality of training.

The proposed methodology also has provision for feature selection. Section 3.3 provides more details on feature selection. The feature selection method exploits entropy and gain measures that are based on filter method. They correlate features with the target class label in order to compute feature importance. As every feature has different feature importance, only the contributing features are selected. This feature selection method is reused by the disease diagnosis algorithm.

3.2 Cardio disease prediction and drug recommendation

An algorithm known as Cardio Disease Prediction and Drug Recommendation (CDP-DR) is proposed. It is evaluated suing dataset collected from [21]. It makes use of the entropy and gain metrics in order to have better determination of the features that are useful in predicting class labels.

Algorithm 1
figure c

Cardio disease prediction and drug recommendation algorithm

Algorithm 1 takes Healthcare dataset D and Pipeline of ML models M as inputs and generates disease detection results and recommendations. In the process, it makes use of EG-HFS for selecting best features prior to training different classifiers in the pipeline. There is an iterative process to train all classifiers and another interactive process to perform detection process against all test instances. The rationale behind this algorithm is to improve quality of training in the prediction of Cardio disease. Unless feature selection algorithm is used, even the best classification algorithms yield mediocre results. In order to get rid of this kind of problem, the proposed EG-HFS is used as part of the main algorithm.

3.3 Entropy and gain based hybrid feature selection

This algorithm is defined to improve the performance of prediction models. The prediction models used in the experiments are known as Random Forest (RF) [22], Logistic Regression (LR) [23], Decision Tree (DT) [25], Stochastic Gradient Boosting (SGB), Gradient Boosting and Extreme Gradient Boosting (XGB). The algorithm takes dataset [21] as input and produces selected features. The proposed algorithm acts as pre-processing step to classification or disease prediction. It is based on the gain and entropy measures that are combined into a hybrid metric in order to have better possibilities in determining useful features. Provided a dataset containing details of patients, the algorithm finds useful features and such features are used in the training phase of the proposed framework. The Cardio Disease prediction models aforementioned are expected to have better performance in disease prediction.

Algorithm 2
figure d

Entropy and gain based Hybrid feature selection

Algorithm 2 takes Healthcare dataset D and Feature importance threshold th as inputs and determine contributing features. In the process, there is computation of different measures like entropy and gain besides the composite metric SU. Finally based on the feature important satisfying threshold, only satisfied features are identified and they are used for further processing in disease prediction.

3.4 Dataset

FAERS (FDA Adverse Event Report System) is collected from [21] and used for experimental results. The dataset contains data pertaining to adverse events and medication errors that are notified to FDA. The dataset contains information that can help in safety surveillance programs, recommendation systems and products that are biological in nature with drug and therapeutic features. The dataset adheres to the guidelines of International Conference on Harmonisation (ICH E2B).

3.5 Evaluation metrics

The confusion matrix as in Fig. 2 is used to derive metrics presented in this section. The confusion matrix provides information that is useful in getting different metrics. This is done by comparing the ground truth values and algorithm predicted or classified results in terms of positive (presence of cardio disease) or negative (absence of cardio disease).

Fig. 2
figure 2

Confusion matrix

Four cases such as correct predictions (True Positive and True Negative) and incorrect predictions (False Positive and False Negative) are used for deriving metrics. In essence there are four possibilities for an algorithm to predict. They are in terms of positive negative predictions. True positive does mean that there is cardio disease really and the algorithm also predicts it positively. True negative on the other hand does mean that the patient has no cardio disease and the algorithm also finds negative about it. False positive does mean that there is no cardio disease really but algorithm predicts as cardio disease. This kind of case in prediction is known as false positive. Similarly false negative does mean that there is really cardio disease but the algorithm predicts it as no cardio disease case.

$$\text{Precision (p)} = \frac{TP}{TP+FP}$$
(5)
$$\text{Recall (r)}=\frac{TP}{TP+FN}$$
(6)
$$\text{F1-score}=2*\frac{\left(p* r\right)}{(p+r)}$$
(7)
$$\text{Accuracy}=\frac{TP+TN}{TP+TN+FP+FN}$$
(8)

Based on these metrics expressed from Eqs. 5 to 8, experimental results of the proposed algorithms are evaluated with certain baseline algorithms. These measures are widely used in evaluation of machine learning algorithms in the field of healthcare and other domains. Each measure has a value ranging from 0 to 1 reflecting 0% and 100% respectively. Each measure reflects higher performance if the value is closer to 1 and low performance if the value is closer to 0.

4 Experimental results

This section presents results of experiments. The results are presented in terms of exploratory results and performance comparison results. The exploratory results mainly reflect the dynamics in the given datasets while performance evaluation results show the performance in prediction of disease exhibited by different models. Dataset is taken from UCI repository. It is heart disease dataset which has 14 attributes including target of class label. It has details of patients consisting of symptoms useful for cardiovascular disease diagnosis. In the process of using the dataset pre-processing is carried out which includes finding missing values and handling them considering mean values in case of numeric attributes.

As presented in Fig. 3, the source implemented has different components. EDA component is meant for data exploration while Pre-Process is meant for detecting and treating missing values in the dataset. Feature Selection component is meant for realising the proposed feature selection method while Model Creation takes care of creating ML models. Model Training and Testing play important role such as learning from training data and classifying using test data. Performance Comparison component is used to use metrics to know performance of each ML model.

Fig. 3
figure 3

An outline of components in source code implementation

4.1 Exploratory results

Different explorations on the data are made prior to disease prediction and drug recommendations. The exploratory results include histograms on weight, age and gender, results of different clusters, word cloud representing diseases in the dataset and drug versus scope information.

Since weight of a person has its influence in the probabilities of diseases, weight attribute and its underlying data in the dataset is subjected to exploratory data analysis. As presented in Fig. 4, the weight of person has its influence in the frequency of disease. It does mean that out of all people, those who have more weight have more disease occurrence. The weight and disease frequency analysis presented reflects the disease trends with respect to weight dynamics of people.

Fig. 4
figure 4

Histogram of weight and frequency of disease in patients

Age is an important factor in human life and wellbeing. As the age increases, there is probability of losing immunity and acquiring certain diseases. Figure 3 shows the relation between age of people and frequency of diseases. As per the data available, the age is one of the factors that indirectly influences cause of diseases. Thus it is established that people with age more than 30 years are found to have more disease frequency. Age and disease frequency analysis provides some insights pertaining to occurrence of diseases.

In the exploratory data analysis with respect to gender, there is an important insight that is female are more vulnerable to cardio disease when compared to male population. Figure 3 shows the frequency of cardio disease for male (1.0) and female (2.0). It also reveals that cardio disease can occur to male population also but it is very less frequent.

As shown in Fig. 5, the different fields in the dataset are shown in horizontal axis and vertical axis to have correlations visualized. The correlation value is presented against each pair of variables.

Fig. 5
figure 5

Correlation matrix

As shown in Fig. 6, it describes the difference between the number of individuals with heart disease, indicated by 1, and those without heart disease, indicated by 0. The figure shows that the data consists of more non-cardio disease patients as compared with cardio patients.

Fig. 6
figure 6

Visualizing the cardio records in terms of 0 and 1

As shown in Fig. 7 the severity of cardio attack with respect to smoke. The patients who don’t smoke will have a less chance of cardio attack other than that it may effects on heart which results in heart disease.

Fig. 7
figure 7

Visualizing the cardio patients with respect to smoke

As shown in Fig. 8, it is useful to visualize data with side by side views. Different sub plots are generated to reflect the data dynamics associated with each attribute in the dataset.

Fig. 8
figure 8

Visualizing subplots for comparing with each features outer layers

Here the Fig. 9 shows the visualizing ROC curve of various proposed models.

Fig. 9
figure 9

Visualizing ROC curve of proposed models

4.2 Results of drug recommendations

With regard to cardio disease case study, drug recommendation is made in a subjective fashion. It does mean that, the patient once diagnosed with cardio disease, the drug recommendations given are specific to the patient based on age, weight and gender dynamics.

4.2.1 Drug recommendation for cardio disease patient: (Gender: Female; Weight: 78; Age: 58)

Recommended drug for the patient is Acebutolol. Table 1 shows different variants that are the results of recommendations.

Table 1 Drug recommendations for a female patient with weight 78 and age 58

As presented in Table 1, the drug recommendation and the score of each drug recommended for given patient with given gender, weight and age. For a patient with Gender: Female; Weight: 78; Age: 58, the table shows the drug recommendations.

4.2.2 Drug recommendation for cardio disease patient: (Gender: Female; Weight: 80; Age: 75)

Recommended drug for the patient is Statins. Table 2 shows different variants that are the results of recommendations.

Table 2 Drug recommendations for a female patient with weight 80 and age 75

4.2.3 Drug recommendation for cardio disease patient: (Gender: Male; Weight: 70; Age: 45)

Recommended drug for the patient is Betaxolol. Table 3 shows different variants that are the results of recommendations.

Table 3 Drug recommendations for a male patient with weight 70 and age 45

As presented in Table 3, the drug recommendation and the score of each drug recommended for given patient with given gender, weight and age. For a male patient with age 45 and weight 70, the table shows the drug recommendations.

4.2.4 Drug recommendation for cardio disease patient: (Gender: Female; Weight: 60; Age: 80)

Recommended drug for the patient is Antiplatelet Agents. Table 4 shows different variants that are the results of recommendations.

Table 4 Drug recommendations for a female patient with weight 60 and age 80

As presented in Table 4, the drug recommendation and the score of each drug recommended for given patient with given gender, weight and age. For a female patient with age 80 and weight 60, the table shows the drug recommendations.

4.2.5 Drug recommendation for cardio disease patient: (Gender: Female; Weight: 57; Age: 20)

Recommended drug for the patient is Stanins. Table 5 shows different variants that are the results of recommendations.

Table 5 Drug recommendations for a female patient with weight 57 and age 20

As presented in Table 5, the drug recommendation and the score of each drug recommended for given patient with given gender, weight and age. For a female patient with age 20 and weight 57, the table shows the drug recommendations. The drug recommendation results are provided based on the disease diagnosis, gender, weight and age of the patient.

4.3 Disease prediction and evaluation

Disease prediction performance of different prediction models such as Random Forest (RF) [22], Logistic Regression (LR) [23], Stochastic Gradient Descent (SGD) [24] and Decision Tree (DT) [25], Gradient Boosting and Extreme Gradient Boosting (XGB) is evaluated using different measures.

Table 6 Performance measures of various prediction models

As presented in Table 6, the performance of prediction models with the proposed hybrid feature selection is evaluated.

As presented in Fig. 10, different prediction models are presented on the horizontal axis and the performance with different metrics is found on the vertical axis. Different prediction models showed varied performance. When accuracy is considered the highest performance is shown by RF with EG-HFS with 0.962359 while the least performance is shown by XGB with 0.5523. DT showed 0.957314, LR 0.668606, XGB with linear kernel 0.642607 while XGB showed 0.5523. From the results, it is understood that RF with the proposed EG-HFS showed significantly improved performance with the proposed CDP-DR method.

Fig. 10
figure 10

Performance evaluation of different prediction models with the proposed hybrid feature selection method

The results section throws light on details that are twofold. First it provides exploration of data in terms of discovering data distributions and feature correlations. This part of the results provides useful information about the data used for the empirical study. It also provides data visualization that enables reader to ascertain facts about how data in the dataset is distributed and correlated among different attributes. After exploratory data analysis, the actual results in terms of disease diagnosis and drug recommendation are provided. Disease diagnosis is based on the ML models used in the empirical study. Drug recommendations are based on the gender, age, weight and the kind of the illness of given patient.

5 Conclusion and future work

In this paper we propose a Disease Prediction and Drug Recommendation Framework (DP-DRF). The framework is realized by defining an algorithm known as Cardio Disease Prediction and Drug Recommendation (CDP-DR). Another algorithm known as Entropy and Gain-based Hybrid Feature Selection (EG-HFS) is defined to leverage quality of training leading to performance enhancement of prediction models. The feature selection algorithm improves performance with a hybrid measure known as symmetric uncertainty that is made up of entropy and gain measures. The experimental results with Cardio Disease prediction as a case study revealed that the proposed framework is useful in disease prediction and drug recommendations and shows better performance over the state of the art. When accuracy is considered the highest performance is shown by RF with EG-HFS with 0.962359 while the least performance is shown by XGB with 0.5523. From the results, it is understood that RF with the proposed EG-HFS showed significantly improved performance with the proposed CDP-DR method. In our future work, we intend to improve our framework further to consider more useful data analytics with healthcare data of different kinds.