Keywords

1 Introduction

Human body is composed of several proteins and amino acids. The sustenance of this body is carried out with the help of minerals and vitamins. Deficiency of any of these essential minerals or vitamins will cause either malfunction of the human body or will lead to disease. One such deficiency of iron in human body will lead to a disease called Anemia. Anemia can be termed as deficiency of hemoglobin caused due to shortage of iron. Anemia is found to be prevalent among the developing countries and most popularly among women and children compared to men of these countries. Globally it is observed as one of the critical health problems.

Identifying the Anemia at the early stage so that it can be prolonged from further deteriorating into advanced stages is one of the most challenging issues. This is due to the non-availability of the data in real-time scenario. It is observed in the literature very few works that have explored this issue of detecting the anemia [1,2,3,4,5]. Hence, we have considered the Indian dataset which had been collected from Eureka diagnostic center, Lucknow, India [6] for the experimental purpose and with the help of Decision Tree model a set of rules has been generated that helps in detecting the anemia at the early stages.

The remaining sections are arranged as follows, Sect. 2 describes related work. Proposed method is discussed in Sect. 3. Section 4 deals with the data description and algorithm used. Results and discussion are elaborated in Sect. 5, while Sect. 6 concludes the work.

2 Related Work

Anemia that is caused by the deficiency of Iron is one of the most critical health problems globally and is a serious public health issue [7]. According to the World Health Organization (WHO), Anemia prevalence of over 40% in a community makes it a public health issue of critical importance [8]. While Anemia prevalence in children can be caused due to genetic reasons or due to deficiencies in nutrition like deficiencies in iron or folate or vitamins A/B12 and copper, iron deficiency is the most important determinant of anemia [9]. Socio demographic characteristics of mothers, households such as region, wealth index, water sources, working status and anemia status along with child features like age, nutritional status, child size at birth are the most critical features influencing anemia in the age group of 6–59 months in children [1]. According to WHO, Anemia prevalence occurs in most of the countries in Africa and South Asia and some countries in East Asia and the Pacific. While the highest prevalence of anemia is found in Africa, the largest numbers of children affected by anemia are found in Asia [2]. Many machine learning models are increasingly used in the analysis and prediction of diseases in the healthcare [10]. Most of studies indicated that machine learning techniques such as support vector machines (SVM), Random Forest and artificial neural networks (ANN) have been applied for the classification of different diseases such as Diabetes [11,12,13], Appendicitis [14], and multiple sclerosis [15]. Machine learning techniques to classify anemia in children are still evolving. Along with traditional clinical practices, machine learning techniques can be utilized to predict the risk of anemia prevalence in children. Some key research in this direction has been undertaken as demonstrated in [3, 16], which have constructed prediction models for anemia status in children. The prevalence of anemia among adults was studied by taking complete blood count (CBC) at a referral hospital in southern Ethiopia. Prevalence and severity were related with age and gender and were analyzed [17]. Social factors such as income, wealth, education can affect health markers in people such as blood pressure, body mass index (BMI), and waist size, etc. [18]. Sow et al. used support vector machines (SVM) and demographic health survey data from Senegal to classify malaria and anemia status accurately [4, 19]. Using feature selection, the number of features of both anemia and malaria datasets were reduced. Using variable importance in projection (VIP) scores, the relative importance of social determinants for both anemia and malaria prevalence were computed. Finally, machine learning algorithms were utilized for the classification of both anemia and malaria–Artificial neural networks (ANN), K nearest neighbors (KNN), Random Forests, Naïve Bayes and support vector machines (SVM) were used [20]. Lisboa has demonstrated the utility and potential of Artificial Neural Networks (ANN) in health care interventions [5]. Using CBC samples, a study to classify anemia using Random Forests, C4.5 (Decision tree), and Naïve Bayes (NB) was undertaken. Comparison of the classifying algorithms using mean absolute error (MAE) and classifier accuracy were computed and tabulated in [21]. Some of the research also applied the Naïve Bayes Classifier and entropy classifier for the purpose of classification [6].

Almugren et al. in 2018 conducted a study using the anemia dataset and investigated how Artificial neural networks (ANN), Naïve Bayes (NB), C4.5 and Jrip data mining algorithms can be used to classify instances in the given dataset as being anemic or normal—that is a binary classification problem. In this study, the performance of these algorithms was benchmarked for a comparative analysis, and it was found that ANN and Jrip algorithms were the best performing algorithms in this regard [22]. In a study, Jatoi et al. used data mining methods on complete blood count (CBC) dataset of 400 patients for detecting the presence of anemia. It was found that Naïve Bayes (NB) algorithm had 98% accuracy in predicting the presence of the disease correctly [23]. In the study conducted in 2019, Meena et al. have used Decision tree algorithms to perform classification on an input dataset representing children for the diagnosis of anemia in the given dataset. They also identified the significant features driving the prevalence of anemia in reference to the feeding practices adopted for infant feeding [24]. Ching Chin Chern et al. have used Decision Tree Classifier models to acquire decision rules for classifying eligibility of Taiwanese citizens to be suitable recipients of tele health services. Involvement of a physician, social worker and health care managers was done to ensure a thorough process and J48 algorithm and logistic regression techniques were used to generate the decision trees representing the decision rules generated [25]. A study done by Lakhsmi K.S et al. has used Association rule mining on medical records to extract decision rules of the type symptom disease. In this computation well-known Association rule mining algorithms like A Priori and FP Growth have been used to derive the decision rules [26]. Song Yan et al. have studied how decision trees can be used to generate decision rules for various medical conditions. In this paper, they have constructed a Decision tree model representing decision rules for the classification and diagnosis of Major Depressive disorder (MDD) [27]. In a study done by Yildiz et al. on a health care dataset obtained from a hospital in Turkey, the authors have used four classification algorithms–artificial neural networks (ANN), support vector machines (SVM), Naïve Bayes (NB) and Ensemble Decision trees to perform classification for various types of anemia and the performance of the algorithms is benchmarked in which Bagged Decision Trees was the best performing algorithm [28]. Heru Mardiansyah et al. have studied the problem of imbalanced datasets and how this can be resolved by using SMOTE techniques to balance the original dataset. In their study, they have selected four datasets from the UCI machine learning repository–German credit cards, Winconsi, Glass and Ecoli to show the application of SMOTE techniques on the given datasets and the resulting datasets arising from this computation [29].

Kilicarslan et al. have constructed two hybrid models using genetic algorithms and Deep learning algorithms of stacked autoencoder (SAE) and convolutional neural networks (CNN) for the prediction and classification of iron deficiency anemia and benchmarked the performance of these two hybrid models in the classification computation for iron deficiency anemia [30].

Although several clinical different machine learning algorithms have been proposed that incorporate several data mining techniques for Anemia prediction, none of them had come up with a set of rules, which come handy in identifying the Anemia existing at different stage. Our proposed methodology is attempting to cover this gap by proposing optimal number of rules.

3 Proposed Method

This work provides an understandable set of rules which can be used for detecting Anemia at early stages using optimized rules extracted from Decision Tree Model using Anemia dataset. As there is unequal ratio of samples of each class, we adopted Synthetic Minority Oversampling Technique (SMOTE) for balancing the minority class. The balanced (SMOTE) dataset is also used with the Decision Tree Classifier for extracting rules. Thus, obtained rules are further optimized and evaluated for their efficacy and strength in classifying the Anemia dataset.

The complete description of the data is given in Sect. 4 and the complete features of the dataset are shown in Table 1.

Table 1 Anemia dataset description

The block diagram of the proposed model can be seen below in Fig. 1. The original dataset is partitioned into training and testing parts using stratified sampling with the ratio of seventy for training and thirty for testing. The SMOTE is applied only on training dataset to obtain the balanced (SMOTE) dataset. Thus, Decision tree is trained on these two different training datasets (i.e., Actual and Balanced dataset) separately. The complete description of the proposed model can be seen in the next sub-section.

Fig. 1
figure 1

Block diagram of the proposed model

3.1 Description of the Proposed Model

The Decision Tree Classifier is trained using Actual Training dataset and Balanced (SMOTE) dataset separately. The model obtained after training, is tested with the testing data and the results obtained are noted as testing results. The rules from the Decision Tree model are also extracted separately for each model. It has been observed that 232 rules were generated by the Decision Tree Classifier when it has been trained using Actual Training dataset. On the other hand, it has given only 26 rules when Decision Tree Classifier is trained using SMOTE dataset.

Since 26 rules obtained from SMOTE dataset were giving significantly good results compared to results using Actual dataset we have considered these rules to take further for evaluating their efficacy and strengths and also to reduce the rules further.

These rules were coded individually and evaluated using the Anemia dataset for their relevancy and efficacy in terms of detecting the Anemia and they are sorted in descending order of their accuracies for each class. Next, two top yielding rules for each class are selected as reduced rules. These rules are again coded and are evaluated using the Testing dataset and the final results are obtained in the form of performance metrics defined.

The description of the proposed model can be defined using the following algorithm.

figure a

4 Dataset Description

The data used for the experiment was collected for the period of September 2020 to December 2020in the form of CBC test reports from the Eureka diagnostic center, Lucknow, India [31]. Data was collected with ethical clearance from the diagnostic center and patient consent was obtained.

The Anemia can be classified into three different types based on their severity level. They are

  • Mild,

  • Moderate,

  • Severe.

The Data consists of eleven attributes as illustrated in Table 1 with the size of the dataset being 364 records. The class variable is named as Hemoglobin (HGB) which has three classes, namely Mild, Moderate and Severe. These three classes have been defined using the range of values, where the Mild range lies between 11.0 and 12.0, Moderate range lies between 8.0 and 11.0 and Severe values lies less than 8.0. The distribution of the three classes in the dataset is 70.32% of Mild, 25.28% of Moderate and 4.40% of Severe.

As the ratio of the three classes were not balanced, we adopted a SMOTE balancing technique for ensuring proper balancing among all the classes of the dataset, so that the machine learning technique used for training would not get biased.

The snapshot of the dataset can be seen below in Fig. 2.

Fig. 2
figure 2

Snapshot of the original dataset

4.1 Algorithm Used (Decision Tree Classifier)

Decision Tree Classifier is a rule-based classifier which works on the basis of entropy. It uses different criterion functions like Gini Index and Information Gain for splitting the given data into one of the classes [32]. It can be clearly represented using a hierarchical tree-based diagram, where the classes are represented at the leaf level and the splitting features are represented at the interior nodes. This algorithm is mostly suitable for the decision-making problems where it mostly classifies the given problem into different classes more accurately. A snapshot of the Decision tree can be seen in Fig. 3.

Fig. 3
figure 3

Example of a decision tree

The implementation of the Decision Tree Classifier was done in python using the sklearn library. Gini Index was taken as the criterion function.

5 Results and Discussion

All the experiments were conducted using hold-out method. The training data was taken a 70%, and testing data was taken as 30%. The splitting was done using stratified random sampling. Results of three models, two Decision Trees with Actual and SMOTED dataset and one proposed reduced rule-based method are evaluated and are presented in Table 2.

Table 2 Results of the decision tree using different datasets and reduced rules-based method

Accuracy, Recall and Precision have been used as performance metrics for measuring the efficacy of the results. These three can be defined as follows

  • Accuracy can be defined as the ratio of total number of patients correctly classified irrespective of class to the total number of patients.

    $$\text{Accuracy }= \frac{\text{Total number of correctly predicted samples} (\text{TP}+\text{FN})}{\text{Total number of samples} (\text{TP}+\text{TN}+\text{FN}+\text{TF})}$$
    Table 3 Efficacy of the reduced rules
  • Recall can be defined as the ratio of the number of patients correctly classified as Anemia class to the total number of Anemia class patients present in the dataset

    $$\text{Recall }= \frac{\text{number of correctly predicted samples of a class} (\text{TP})}{\text{Total number of samples in that class} (\text{TP}+\text{FN})}$$
  • Precision can be defined as the ratio of the number of patients correctly classified as Anemia class to the total number of patients classified as Anemia class.

    $$\text{Precision }= \frac{\text{number of correctly predicted samples of a class} (\text{TP})}{\text{Total number of samples classified as class} (\text{TP}+\text{FP})}$$

    The rules extracted using Actual dataset were 232 which is actually a huge number, and it would be difficult for the end user to interpret or use these many rules. Moreover, the results using these rules were not so significant. This is due to the imbalance nature of the dataset. Hence, SMOTE technique has been employed to balance the dataset. Using the balanced dataset with the Decision Tree Classifier, we obtained 26 rules which were more concise than earlier method and were also giving improving results. The rules using the balanced dataset are shown in Appendix A. Though the rules were concise, their efficacy was not so promising, and they were containing some not so important rules. Due to this, it would be difficult for the end user to use these set of rules handy.

    So, in order to evaluate the efficacy of the rules, the 26 rules were coded and their efficacy in detecting the Anemia were computed by applying these rules on the Actual dataset. Then, the rules were sorted in descending order of their efficacy of their respective classes. Top two rules from each class are extracted as reduced rules whose efficacies are shown below in Table 3. The reduced rules are presented in Appendix B.

Efficacy is computed using the following formula,

$$ {\text{Efficacy}} = \frac{\begin{aligned} & {\text{number of correctly classified patients of a given}} \\ & \quad ~{\text{class by a specific rule}} \\ \end{aligned} }{{{\text{Total number of patients in that class}}}} $$

Next, these reduced rules were also coded and their performance metrics like Recall, Precision and Accuracy were also computed on the Actual dataset which are shown in Figs. 4, 5 and 6.

Fig. 4
figure 4

Recall values for the three classes using decision tree and reduced rules-based method

Fig. 5
figure 5

Precision values for the three classes using decision tree and reduced rules-based method

Fig. 6
figure 6

Accuracy values for the three classes using decision tree and reduced rules-based method

It can be observed from Table 2 that the Decision Tree Classifier using the Actual dataset has given the Accuracy, Recall and Precision values which are very low. This is due to the imbalance nature of the dataset. Whereas, in the case of Decision Tree Classifier using the SMOTED dataset, it can be observed that all the three metrics have improved compared to the results of the Actual dataset. While in the case of proposed Reduced rules-based method, it is evident that due to the elimination of not so important rules the recall value for the Anemia dataset has improved significantly in the case of Mild class and Severe class. Though there is slight reduction of 1% in the accuracy of Reduced rules-based method compared to Decision Tree Classifier using SMOTE dataset, it might be noted that the rules were reduced upto 77% which is a promising and significant contribution of this work. Having minimum rules which can be used for detection of Anemia would be an important tool for the end user to use.

Notations used in the following figures are,

  • DT (Actual dataset): Decision Tree using the Actual dataset

  • DT (Actual dataset): Decision Tree using the SMOTE dataset

  • Reduced rules method: Reduced rule-based method on Actual dataset

Figure 4 shows the recall values of the three classes using the three different methods, two Decision Tree with Actual and SMOTED dataset and one with proposed Reduced rules-based method. It can be seen that the Reduced rules-based method has given good recall values of 96.48 and 100% compared to other methods in the case of Mild and Severe class. Whereas in the case of Moderate class, the Decision Tree Classifier with reduced (SMOTED) dataset has given good recall value of 88%.

Figure 5 shows the results of the precision values of the three different methods. It can be observed that in the case of Decision Tree Classifier using the Actual dataset, the precision values have been decreasing across different classes. While in the case of the Decision Tree Classifier using the SMOTED dataset and coded Reduced rules method the precision values have been slightly differing across all the classes. The precision value in case of Mild is topped by the Actual dataset, whereas in the case of Severe class the proposed Reduced rules-based method has given high result. While in case of Moderate class the Decision Tree Classifier using SMOTED dataset has topped.

As far as Accuracy is concerned, Fig. 6 shows that the accuracy has been improved in case of the Decision Tree Classifier using SMOTE dataset compared to Actual dataset and reduced rule dataset. The Reduced rules-based method has ranked second compared to the Decision Tree Classifier using SMOTE dataset. This may be due to the loss of information due to the reduction in the rules.

This significant improvement in Accuracy, Recall and Precision of the Decision Tree Classifier using SMOTE data and Reduced rules method can be attributed to the unbalanced nature of the dataset.

Since recall is an important parameter which helps in identifying the target class, it is significant to get a high recall which can classify any given test sample more confidently. Hence, as the Reduced rules-based method has given highest Recall for the Mild and Severe class, it helps in detecting the presence of Anemia at the early stages there by helping the medical fraternity. The use of reduced decision rules would come handy for the medical practitioners in detecting the Mild class Anemia and thereby giving suitable medication for delaying from further deterioration. This is the novel contribution of this work.

6 Conclusion and Direction for Future Work

Anemia detection is one of the challenging issues in current scenario. To address this issue, we have taken Anemia dataset from India. Due to the unbalanced nature of the dataset, we have used SMOTE technique to balance the classes. Decision Tree Classifier and Reduced rule-based method has been used to detect the presence of Anemia from a given dataset. The Reduced rule-based method was able to achieve significant results, especially in the case of Mild class compared to the Decision Tree Classifier using Actual dataset and SMOTE dataset. Due to the smaller number of rules given by reduced rule-based method, it can also be used as handy tool for detection of Anemia. As a future work deep learning algorithm can be used to anemia classifying and optimization-based methods may be used for rules reduction.