1 Introduction

Data mining helps to discover hidden patterns, as well as descriptive, predictive, and comprehensible models in all kinds of data [1]. The most common kinds of data for applying data mining techniques are data warehouse data and database data. Other forms of data such as data streams, spatial data, or network data can also be used [2]. In other words, its purpose is to analyze data from different aspects and extracting knowledge from it. The data mining techniques have been developed in recent years and include generalization, characterization, classification, clustering, association mining, pattern matching, data visualization and meta-rule guided mining [3]. The discovered knowledge by data mining approaches can be applied for different applications in various sectors such as the healthcare industry. It is a fact that most of the information which are available at this sector is unstructured, massive and high dimensional. Consequently, a good approach can bring a remarkable opportunity to assist physician deal with this large amount of data [4]. The main aim of this sector is to promote, enhance, and protect the population’s health, which needs to be quantified first. As a result, data mining can identify patterns, relationships and models, which support predictive and decision-making processes for diagnosis and treatment planning. The discovered models can be called predictive models, which can be integrated in hospitals’ information systems as models that are able to decrease the time of decision-making [5]. Automatic medical diagnosis systems tend to benefit from collected datasets and decision support systems.

This study proposes a method for predicting the patients’ diseases regardless of their medical and medicinal records. The dataset comes from a survey conducted by Non-Communicable Diseases Research Center (NCDRC)Footnote 1 in the province of Markazi, enlisting the drugs prescribed for each patient according to their insurance record. Different classification methods are employed to predict the diseases based on the drugs in the prescription. Experiments illustrate that the results of predictions are acceptable. The remainder of this paper is organized as follows: Section 2 provides a review on relevant studies. Section 3 explains the basis of the employed classification method, i.e. stacking. The proposed method is explained in Section 4. Section 5 presents the experimental results and discussions. At last, Section 6 concludes the paper.

2 Problem statement

Data mining techniques are applied in health systems to scientifically use data and analytics in order to detect inadequacies and the best practices, which guide them to cut down extra costs. This requires information on the health status of the population. However, an organization that usually registers clinical diagnoses does not have available information all the time. While pharmacy-based data, which usually covers a large population, is available information to identify chronic conditions (CCs) in populations, which is mostly known as “Burden of Disease” [6]. Although estimating the global burden of disease has been done in many countries such as USA [7], in other countries, which do not have a connected hospital systems, it is such a difficult task. Besides, using pharmacy data is a reliable source for these countries to help estimate the burden of disease. From the medical point of view, different factors such as valuable chronic diseases, the results of treatment, the current improvements, or conflicting deterioration, subjective and objective evaluation of the patient, and also other valuable data can highly effect on the diagnosis and decision-making in patients’ treatment [8]. These are the reasons why prescription-based data can have a vital role in each society in order to help them promote the community health.

3 Literature survey

In recent years, numerous studies have been conducted on predicting different diseases, disease treatment, and drug discovery in many hospitals. Different data mining techniques have been applied for diagnosis and gained different results. In the following, a short literature survey regarding data mining techniques and evaluation in healthcare have been developed.

Chen et al. applied usages of big data analytics in hospitals’ historical data. In their research, Patient Treatment Time Prediction (PTTP) algorithm was used in order to predict the waiting time for each treatment task for a patient. PTTP model was best fitted by a random forest optimization algorithm. Each patient’s requested treatment tasks were asks and a Hospital Queuing-Recommendation (HQR) system recommends an efficient and suitable treatment plan with the least waiting time for the patient [9].

Bellaachia and Gauven used data mining techniques such as Naïve Bayes, C4.5 decision Tree and Back Propagated Neural Network in order to perform a study to predict the survivability rate of breast cancer patients. The dataset, which is used in this study, is from the SEER Public-Use Data. It consists of 151,886 records, which have all the accessible 16 fields from the SEER database. They reached the conclusion that the model created by C4.5 has a vastly improved performance with 93.6% accuracy compared to other methods [10].

Palaniappan and Awan proposed a classification method for an Intelligent Heart Disease Prediction System (IHDPS) using data mining techniques, which included Naïve Bayes and Neural Network. They used medical profiles such as age, sex, blood pressure, and blood sugar, which could help them predict the likelihood of heart disease among patients. IHDPS is a dependable Web-based and user-friendly system [11].

Repalli built a model to predict how likely it was for the people of different age groups to be affected by diabetes following their activities. The medical data had 50,784 records with 37 variables. The selected model for the prediction of the binary target variable was a decision tree with a low misclassification rate. The results demonstrated that 34% of the population under the age of 20 were not affected by diabetes, 33.9% of the population whose age were above 20 and below 45 were not affected by diabetes and 26.8% of those aged over 45 were not diabetic [12].

4 Algorithm: Stacking

The increased accuracy of an ensemble, which is due to a reduction in the model’s variance and its bias, is based on the simple but powerful process of group averaging or majority vote [13]. In other words, it is a collective decision-making system, which is able to combine the predictions of learned classifiers in order to create prediction of new instances. Stacking is one of the ensemble methods, which is similar to Boosting and Bagging (Bootstrap aggregating). Boosting is a machine-learning ensemble algorithm used to reduce variance and bias. It is based on turning a set of weak learners into strong learners. Robert Schapire’s answer was based on “hypothesis boosting problem” [14] that had significant contributions in machine learning and statistics, most remarkably the development of boosting; therefore, they finally formulated Adaboost (Adaptive Boosting). On the other hand, Bagging was designed to improve the stability and accuracy of machine learning algorithms, which are applied in statistical classification and regression. Not only does it reduce variance but also it helps to avoid over-fitting.

There are two approaches for combining models. The first one is voting, in which the class predicted by the majority of the models is selected, while in stacking the predictions by each different model is given as input for a meta-level classifier whose output is the final class.

Stacking which sometimes called stacked generalization, is an approach to combine several machine learning techniques into one predictive model in order to improve the predictive force. It generates a global classifier by training a meta-level learner for combining the predictions of the base-level classifiers [15]. At first, the available data is used to train all of the other algorithms, then a combiner algorithm is trained in order to make a final prediction. In this step, all the predictions of the other algorithms are used as additional inputs. Stacking had also lead to great results on both supervised learning tasks like regression, classification and distance learning and unsupervised learning such as neural networks and density estimation. Table 1 compares bagging, boosting, and stacking techniques:

Table 1 Comparison of three ensemble methods

5 The proposed method

In this section, in the first part, the way of collecting data is explained and then in the second part, a suitable model for disease prediction is presented.

5.1 Data collection

The dataset consists of drug information of 600 patients, which is provided by specialists in the medical field. Each instance has 388 attributes namely; sex, age and the name of 386 drugs that were prescribed to them. Only the drug names were selected and then they were put in different columns as different attributes. After that, “1” was set to an instance who had bought that drug and “0” to the one who had not.

The first purpose of this research is to predict what kind of physician each instance has referred to. The class of data consists of 34 different doctors. Since the frequencies of each physician were not significant, a different processing was used by statistical methods and a decision to divide the labels into two classes was made, “general practitioner (G)” and “other specialists (NG)”. Table 2 shows a brief summary of attributes in the dataset.

Table 2 Attributes description of the dataset

The second goal of this research, which is more important, is to predict what kind of disease each instance is suffering from. A group of people, which consisted of three pharmacy students and two professors, were gathered to predict each patient’s disease only by the name of the drugs given to each instance in the dataset. Their predictions did not have any errors since the results were checked twice by each professor. The number of classes (number of different diseases) is 70. Therefore, each instance in the dataset has a label of 70 different diseases. According to frequencies of each class, the first nine classes in descending order are as follow. The abbreviation of each disease is shown in parentheses. Common Cold (C), Vaginal infection (VI), Poisoning (P), Stomach disorders (S), Depression (D), Gastrointestinal distress (GI), Allergy (A), Infection (IN) and Other diseases (O).

It is also good to mention that the number of patients who suffered from common cold was almost more than twice the other groups and consequently the data set was biased. Table 2 describes the attributes, which are used in the dataset. Label 1 is the type of doctor as general or specialist. Label 2 is the name of the disease.

5.2 Modeling

The main methods of data mining are categorized into two classes, predictive and descriptive methods. In this paper, predictive methods are used in order to predict two different labels, which were explained in the previous part. Classification, regression, and deviation detection methods are three learning methods of the model in data mining with a predictive nature [16]. In this section, the stacking model, which was designed to reach the highest accuracy, will be described. As it was shown in Table 2, there are a large number of attributes regarding the number of instances. As a result, there were too many pairwise correlations between the variables to be considered and the accuracy of prediction model would suffer if they were included. In addition, overfitting can occur as well. To prevent overfitting, applying ensemble methods which are one of the most common solutions are used. Besides, in order to escape the curse of dimensionality, Principle Components Analysis (PCA) was employed on the input attributes. PCA is a dimension reduction method that uses orthogonal linear projections to reduce the number of parameters. In other words, a set of correlated variables will be transferred into a new set of uncorrelated variables. Generally, a vector space transform is used to reduce the dimensionality of large data sets [17].

5.2.1 Model 1

A Stacking algorithm is proposed to model the dataset. This Stacking algorithm has three base learners including kNN [16], Decision tree, and Support Vector Machine (SVM). Naïve Bayes is the Meta learner of the Stacking model. Fig. 1 shows the stacking model for predicting the type of physician, general practitioner and other specialists. In the proposed stacking model, Meta features, which are the results of the prediction of the three used classifiers, are added to the original features of the instances. Consequently, the Meta learner, which is the Naïve Bayes (NB) classifier, models the instances with 388 + 3 features. After applying PCA, the maximum number of components, which led to the best result, was 15 attributes.

Fig. 1
figure 1

Stacking model 1

5.2.2 Model 2

The stacking model for predicting Label 2 of this dataset includes four classifiers: kNN [18], Decision tree, Generalized linear model and Random Forest(RF) as base learners and Naïve bayes as the meta learner. In this case, the Meta learner, which is a Naïve Bayes classifier, models the instances with 388 + 4 features. PCA was applied in this version of the dataset as well to reduce the dataset to a lower dimension and the maximum number of component was 13. Fig. 2 shows the stacking model for the classification of diseases based on the patients’ prescriptions.

Fig. 2
figure 2

Stacking model 2

6 Experiments

The dataset consists of 600 records. Two attributes were considered as labels and separate experiments have done on each of them. The first one is the type of physician each patient referred to and the second one is the disease he suffers from. The data-mining tool RapidMinerFootnote 2 is used for doing experiments. The parameters of base learners and their values are shown in Tables 3 and 4.

Table 3 Accuracy of the stacking model in predicting Label 1
Table 4 Accuracy of the stacking model in predicting Label 2

In the Table 5, it is shown that the accuracy of other ensemble methods with 15 and 13 attributes for the first and second Labels of the dataset is less than stacking method. The learner for both Adaboost and Bagging methods is Decision Tree and its parameters are shown in Table 5.

Table 5 Comparison of other ensemble techniques

However, the results of the experiments would have been worse if PCA had not been used. In fact, the use of PCA has significantly improved the results. Table 6 shows the accuracies for two stacking models without applying PCA.

Table 6 Accuracies without applying PCA

Tables 7 and 8 compare the accuracy of both stacking models, using various base learners and Meta learners which led to best three results.

Table 7 Evaluation of best classifiers For label 1 of the dataset
Table 8 Evaluation of best classifiers for label 2 of the dataset

The confusion matrices of the models for predicting Label 1 and Label 2 are shown in Table 9 and Table 10 respectively. For predicting Label 1 (G or NG), there are two classes, and therefore a 2 × 2 confusion matrix can be obtained. While for the second version of the dataset, there are nine classes with a 9 × 9 confusion matrix.

Table 9 A confusion matrix for label 1 of the dataset
Table 10 A confusion matrix for label 2 of the dataset- 9 classes

According to Table 10, the prediction for four classes “D”, “GI”, “A” and “IN” is mistaken with class “O”. As it can be seen, the final model failed to predict class “D” disease (Depression) and its class recall is equal to zero, although this disease had a large enough distribution in the original dataset. Considering class “D” in class “O”, there will be 8 classes. Besides, this change has made the accuracy of the model 61.83%. Table 11 shows the confusion matrix for these experiments.

Table 11 A confusion matrix for label 2 of the dataset- 8 classes

Table 12 shows a short summary of different accuracies for different data mining techniques, which were discussed above.

Table 12 Comparison of data mining techniques

Finally, Tables 13 and 14 are provided to demonstrate that the proposed ensemble method has a higher accuracy compared to single classifiers like kNN, Naïve Bayes, decision tree etc. in the first and second label of dataset. According to Section 4, a collective decision-making system gives the opportunity to reach a better result.

Table 13 Comparison of single classifiers in predicting label 1
Table 14 Comparison of single classifiers in predicting label 2

7 Conclusion

The overall objective of this paper was to predict what kind of physician each patient has referred to and what kind of diseases- among nine classes- each instance is suffering from. Three data mining classifications were used, which were Decision tree, Naïve Bayes, and kNN. However, none of them worked properly. Therefore, a Stacking classifier was applied, which was shown that has higher accuracy than single classifiers. In the first version of the dataset, three different base learners include kNN, Decision Tree and SVM were applied for classifying while in the second version, four diverse base learners including kNN, Decision Tree, Generalized Linear Model, and Random Forest were used in the stacking operator. This system can be further developed and it can use a larger number of drugs as input attributes. Predicting patients’ disease is the first step. Health care industry can use this information to identify persons with chronic conditions, which is an indirect method to estimate prevalence. Pharmacy drug data may be a valuable source for measuring population’s burden of disease, when clinical data are missing. In addition, Other data mining techniques like Association Rules can be used. Clearly, association rules are in the category of descriptive methods and only some descriptive rules can be obtained out of the dataset.