Rule Generation of Cataract Patient Data Using Random Forest Algorithm

Nair, Mamta Santosh; Pandey, Umesh Kumar

doi:10.1007/978-3-030-44689-5_16

Mamta Santosh Nair¹⁰ &
Umesh Kumar Pandey¹¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11886))

Included in the following conference series:

International Conference on Intelligent Human Computer Interaction

661 Accesses

Abstract

Cataract is one of the common problems among the humans. Cataract is the condition caused due to clouding of lens in the eye which eventually may lead to blindness. In last few years, data mining has been widely used to build the predictive model in various fields. In this paper, historical data of cataract patient has been used to build the predictive model. Random forest algorithm is one of the decision tree algorithms for predictive modeling. Random forest algorithm incorporates advantages of classification and regression. Present study uses random forest method to create a model for prediction of cataract. The random forest algorithm is also tested for Out of Bag estimation error.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Prediction of postoperative complications of pediatric cataract patients using data mining

Article Open access 03 January 2019

Optimized Random Forest Algorithm with Parameter Tuning for Predicting Heart Disease

Hybrid Model for Heart Disease Prediction Using Random Forest and Logistic Regression

Keywords

1 Introduction

Decision making is a tough job. One decision relies on many factors. In data mining algorithms, decision tree algorithms are one of the most widely used algorithms for predictions. Random forest algorithm is one of such robust and simplistic algorithm that works on ensemble learning method.

Data mining is applied in medical science, astronomy and other field to extract information from the data set. This data set has large number of attribute and complexity of inferring information.

One of the major causes of blindness in the world is cataract. Cataract is the preventable blindness if the patient is operated in time. Several organizations worldwide are working towards spreading the word about cataract and also about surgeries performed for cataract. According to World Health Organization, Cataract is responsible for 51% of world blindness [15]. World Health Organization defines this condition as “Cataract is clouding of the lens of the eye which impedes the passage of light. Although most cases of cataract are related to the aging process, occasionally children can be born with the condition, or a cataract may develop after eye injuries, inflammation, and some other eye diseases” [15]. The statistics collected from many agencies and previous literatures are serious enough to take a giant step towards preserving the vison. Data of cataract patients’ needs to be studied and analyzed so as to reveal the hidden trends which can further be used to create awareness among general population.

In this paper, we have collected the data of patients with eye problems among which many are suffering from cataract. And the collection also includes other details of patients like dietary habits, addictions, living environment etc. which may help to predict the chances of getting cataract. Data mining algorithms carry out this assessment to assist in the decision making process.

2 Literature Review

Data mining can help see us what is not directly visible but is underlying the obvious. It finds out the pearls of patterns and trends from the oceans of data. Data mining performs analysis of information to find possible outputs [3]. The methods where the hidden trends of data are identified, analyzed and then categorized into useful knowledge is known as Data Mining [4]. It finds patterns or trends, which are interesting and useful too. It helps to see beyond all the knowledge. And finally, it allows one to decide upon facts and predict the classes. Data Mining can play a significant role in arranging the data into different classes [6].

Decision tree algorithm breaks the dataset multiple times from top to bottom approach and then later horizontally at the same level till all the data items belonging to a class are identified [5]. A decision tree structure is made of root, internal and leaf nodes. Most decision tree classifiers perform grouping or classification in two steps: firstly, a tree is grown fully and then shortening or trimming of trees are done. The tree is grown from the top first then it is divided further into branches till all class labels are identified. While trimming process is carried on, a tree is cut wherever required to improve the accuracy. The trimming begins from lowermost node [10].

A decision tree is like a flowchart in structure and layout where every inner node represents a condition on an attribute and each branch represents a yes/no result of the condition and class label is represented by each leaf node (or terminal node). The leaf node is the last node. Classification rules are generated going from the top node to the terminal node of the decision tree [2].

Classification algorithm learns in supervised environment. It finds out and allocates class labels to data items by applying the already acquired knowledge of class which the data records belong [1]. Classification technique can be solving several problems in different fields like medicine, industry, business, and science. Basically it involves finding rules that categorize the data into disjoint groups [14].

The objective of the classification is to build a model based on some example cases with some attributes to describe the objects or one attribute to describe the group of the objects. Then, the model is used to predict the group attributes of new cases from the domain based on the values of other attributes [12].

Classification is the step wise process of finding a set of models which describe and performs allocation of data classes. The derived model is based on the analysis of a set of training data (i.e. data objects whose class label is known) [13].

Random Forest algorithm is a classifier model consisting of collection of trees or jungle like appearance where independent random vectors are distributed identically and every tree ends terminally for the accurate class [8]. At each step new random vector is generated which is independent of the previous random vectors with same distribution and then forms a tree using the training set [9]. Random Forest uses decision Trees as base classifier. This ensemble learning method is used for classification and regression of data. An ensemble consists of number of trained models whose predictors are combined to classify new variables.

Random forests are an effective tool in prediction. Because of the Law of Large Numbers, they do not overfit. It inserts just the right amount of randomness and we get good and accurate classifiers and regressors [7]. The random selection of dimensions to choose the splitting variable can be done as well as the choice of coefficients for random combinations of features [11].

Nayer [18] did his research work on diabetes mellitus detection using machine learning. Stacking ensemble method used in this research work built upon linear discriminant analysis, recursive tree and KNN.

Beaulac and Rosenthal [19] studied undergraduate students of Canada university in past 10 years using random forest. Using random forest, they identified most important variable useful to the classifier that reveals information for the university administration.

Sugandhi, Yasodha, Kannan [20] used five classification algorithms for prediction of cataract. The algorithms used by them were Naïve Bayes, SMO, J48, REP Tree and Random Tree. Authors also found mean absolute error and correctly classified instance generated by all the algorithms. They found random forest algorithm to be most accurate classifier with prediction accuracy at 84.87%.

Niya [21] developed automatic cataract detection methodology. The methodology involved pre-processing, feature extraction and classification. SVM classifier was used for prediction of cataract and regression method used for grading of cataract.

3 Data Collection and Research Instrument

The research work uses cataract patient data for the study. Dataset used in this research work is primary data collected through questionnaire. Questionnaire has been designed in consultation with Ophthalmologists. The questionnaire has also been designed considering the factors responsible for cataract as per specified by World Health Organization website. World Health organization mentions smoking, diabetes mellitus, exposure to ultra violet rays and high body mass index to be some of the cataract causing parameters [15]. Keeping in view of all factors total 43 different parameters selected for the data collection. These parameters included personal details, food habits, medical and birth history and addictions etc. The target location of the data collection is Raigad District of Maharashtra, India. Questionnaire was prepared in English and Marathi language. This questionnaire distributed among the cataract patient of approximately 700. Because of low education, most of the respondents are not familiar with the questionnaire system, thus assistance provided for the form filling. The data include people of both genders of different age groups. The data also had good mix of rural including tribal as well as urban population. Total approx. 500 forms received and filled at the camps and outpatient department (OPD) of doctors. Only 297 forms found complete and were selected for analysis. Certain parameters in questionnaire have received no answers or very less amount of entries. Thus, those attributes were removed from the dataset and only 17 attributes were considered for the study.

From the dataset attribute ‘cataract’ is used as a class name and other 16 variables are predictor variable. The dataset is studied in R software for performing random forest algorithm. R has inbuilt packages for random forest. Packages used in this study are “randomForest”, “dplyr”, “readxl” and “reprtree”. Table 1 represents Attribute name and symbolic name used in the code development and to increase the visibility of the tree.

Table 1. Attribute names and abbreviations

Full size table

4 Importance of Attributes

One of the most robust characteristics provided by Random Forest is the importance factor of attributes. Table 2 gives the list of attributes with their importance in Class 1 and Class 2. Table 2 also gives the Mean Decrease Accuracy and Mean Decrease Gini. Mean Decrease Accuracy is where values of variables are randomly permuted and it is also known as permutation importance.

Table 2. Importance of attributes

Full size table

Mean decrease Gini is also known as Gini importance. The mean decrease in Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest. Each time a particular variable is used to split a node, the Gini coefficient for the child nodes are calculated and compared to that of the original node. The Gini coefficient is a measure of homogeneity from 0 (homogeneous) to 1 (heterogeneous). Attributes with a large mean decrease in accuracy are more important for classification of the data. In the given Table 2 attribute age is highest important with the Meandecrease accuracy at 29.2196387 followed by attribute type of surgery at 17.944127 and so on. Similarly, Meandecreasegini is highest for attribute age at 37.377705 followed by attribute weight and so on.

The graph plotted for the variable importance is shown in Figs. 1 and 2. The plot shows each variable on the y-axis, and their importance on the x-axis. Attributes are ordered top-to-bottom as most- to least-important. Therefore, the most important attributes are at the top and an estimate of their importance is given by the position of the dot on the x-axis. Three least important variables were removed but OOB estimation error increased after removing it. Random forest algorithm used all 16 variables for rule generation.

5 OOB Estimation Error

The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry [16]. For each observation zi = (xi, yi), construct its random forest predictor by averaging only those trees corresponding to bootstrap samples in which zi did not appear [17]. The out of bag estimate chooses all the samples which were left during the tree creation and error is estimated for that sample.

6 Analysis and Discussions

Random forest generates an OOB error estimation depending upon the seed value and mtry. Table 3 shows OOB error estimation at different seed value and mtry. Random forest code run in R, for different values of mtry ranging from 3 to 13 and the value of set.seed was from 1 to 10. Set.seed is used for staring point of random number generation and mtry tuning parameters. Thus total 130, OOB estimation recorded as shown in Table 3.

Table 3. OOB error estimation at set.seed and mtry

Full size table

Among the obtained OOB error estimation values, lowest value (30.98) is obtained at set.seed value 3 and mtry value 12, which is used for further study and generating rule. The R Code generates the 500 trees and selecting 12 variables at each spilt. Total 297 records considered for developing the model.

Table 4 shows confusion matrix. Confusion matrix shows that total 92 records are correctly classified into class 1 whereas 54 records are wrongly classified. Similarly, 113 records classify correctly into the class 2 whereas 38 records are wrongly classified. Classification error for the class 1 is 0.3698 and for the class 2 classification error is 0.2516556.

Table 4. Confusion matrix

Full size table

Accuracy related various parameters calculated from the confusion matrix are given in the Table 5 as diagnostic testing of accuracy. In Table 5 accuracy of classification of the tree reported 69.0326%. Another important point to identify the accuracy is precision and recall. Precision and recall both collectively represent detailed picture of accuracy. At one side precision represents relevancy whereas recall represent correctness of the model. Precision value of the decision tree is 67.6646% whereas recall is 74.8344%. Precision value explains 67.6646% of positive identification actually correct and recall explains that 74.8344% actual positives identified correctly. Misclassification rate of the decision tree is 28.9562%. Epidemiologist and other use prevalence which is in contrast incidence measure new cases in the population. Point prevalence reported in the Table 5 is 50.8417%. Prevalence explains that reported percentage of people are having condition of cataract at the time of collection of data. False positive rate i.e. 36.986% condition improperly exist. True negative rate is 63.013 reported in the Table 5 explains that actual nonexistence of condition is correctly classified. F score indicator represents harmonic mean between precision and recall. F score value is reported in Table 5 is 71.0691% represents similarity between the groups.

Table 5. Parameters obtained from the values of confusion matrix

Full size table

Table 6 shows the database of tree generation. Sr. no. shows the node number. Left daughter column indicates the node number which is associated with the left part of the splitting node. Right daughter indicates that which node number is associated with right part of the splitting node. Split var indicates the name of variable which is used for the splitting. Status column indicates that whether node is terminal or non-terminal node. If the status is 1 it means it is non-terminal and −1 status indicates that it is terminal node and indicates class name. Predication column shows the name of the class. <NA> in prediction column indicates that node is not leaf node and has further left or right or both sub-trees.

Table 6. Rules generated by Random Forest

Full size table

Table 6 enlists the rules generated by random forest. First column of the table represents node of the tree. Second column highlights left child of the current node. Third column represents right child of the current node. Fourth column represents code name of the splitting variable. Splitting code is defined in the Table 2. Fifth column is the split point that represents threshold value. Continuous values less than goes to the left side of the tree and greater than and equal into right side of the tree; in case of categorical variable respective values are mentioned in the column. Column 6 is status represents whether the current node belongs to leaf node. Status 1 represents non-leaf node whereas −1 represents leaf node. Last column is prediction that shows class label. If the node is non leaf node then this column contains value <NA> which means class identification is not required. Figure 3 is the tree representation of random forest generated rule.

7 Conclusion

Cataract condition develops in the lens of eye. Ophthalmologist consider numerous factors like living habits, age, gender as well medical conditions like diabetes, cholesterol level etc. for cause of cataract. In consultation with ophthalmologist, primary data was collected and studied using random forest algorithm. Random forest algorithm has given lowest OOB error estimation when value of set.seed was set to 3 and value of mtry set to 12. From Table 1, it is concluded that most important attribute for predicting the cataract is age and other factors in the order of importance has been shown in the Figs. 1 and 2. To predict the presence of cataract in the patient have been shown as rules in Table 6 and also visualized in Fig. 3. From confusion matrix shown in Table 3, it is concluded that for cataract yes the classification is more accurate (error is 0.2516556) and less accurate (0.3698630) for no cataract. Variable importance and tree path are useful to predict possibility of the cataract in individuals.

References

Anyanwu, M.N., Shiva, S.G.: Comparative analysis of serial decision tree classification algorithms. Int. J. Comput. Sci. Secur. (IJCSS) 3(3), 230–240 (2009)
Google Scholar
Sharma, H., Kumar, S.: A survey on decision tree algorithms of classification in data mining. Int. J. Sci. Res. (IJSR) 5, 2094–2097 (2016)
Google Scholar
Gupta, B., et al.: Analysis of various decision tree algorithms for classification in data mining. Int. J. Comput. Appl. 163(8), 15–19 (2017). (0975–8887)
Google Scholar
Kesavaraj, G., Sukumaran, S.: A study on classification techniques in data mining. IEEE-31661, 4–6 July 2013
Google Scholar
Black, P.E.: Greedy Algorithm, in Dictionary of Algorithms and Data Structures, U.S. National Institute of Standards and Technology, February 2005. NIST-greedy algorithm
Google Scholar
Alaoui, S.S., Labsiv, Y., Aksasse, B.: Classification algorithms in data mining. Int. J. Tomogr. Simul 31, 34–44 (2018)
Google Scholar
Breiman, L. Random Forests. Mach. Learn. 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Goel, E., et al.: Int. J. Adv. Res. Comput. Sci. Softw. Eng. 7(1), 251–257 (2017)
Article Google Scholar
Brieman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article Google Scholar
Uwe, K., Dunemann, S.O.: SQL database primitives for decision tree classifiers. In: CIKM 2001. ACM, Atlanta (2001)
Google Scholar
Denil, M., et al.: Narrowing the gap: random forests in theory and in practice. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China, vol. 32. JMLR: W&CP (2014)
Google Scholar
Kaur, S., Grewal, A.K.: A review paper on data mining classification techniques for detection of lung cancer. Int. Res. J. Eng. Technol. (IRJET) 03(11), 1334–1338 (2016). e-ISSN: 2395-0056, p-ISSN: 2395-0072
Google Scholar
Parashar, H.J., Vijendra, S., Vasudeva, N.: An efficient classification approach for data mining. Int. J. Mach. Learn. Comput. 2(4), 466 (2012)
Google Scholar
Kumar, S.V.K., Kiruthika, P.: An overview of classification algorithm in data mining. Int. J. Adv. Res. Comput. Commun. Eng. 4(12), 255–257 (2015). ISSN (Online) 2278-1021, ISSN (Print) 2319 5940
Google Scholar
https://www.who.int/blindness/causes/priority/en/index1.html. Accessed 29 Sept 2019
Janitza, S., Hornung, R.: On the overestimation of random forest’s out-of-bag error. PLoS ONE 13(8), e0201904 (2018). https://doi.org/10.1371/journal.pone.0201904
Article Google Scholar
Hastie, T., et al.: The Elements of Statistical Learning Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7
Book MATH Google Scholar
Nayer, A.M.: Detecting diabetes mellitus using machine learning. Int. J. Comput. Syst. 03(12), 670–677 (2016). ISSN 2394-1065
Google Scholar
Beaulac, C., Rosenthal, J.S.: Predicting university students academic success and major using random forest (2019). https://doi.org/10.1007/s11162-019-09546-y
Sugandhi, C., Yasodha, P., Kannan, M.: Analysis of a population of cataract patients databases in weka tool. Int. J. Sci. Eng. Res. 2(10), 1 (2011). ISSN 2229-5518
Google Scholar
Niya, C.P.: Automatic cataract detection and classification systems: a survey. TechS Vidya e-J. Res. 3(2014–15), 28–36 (2015). ISSN 2322-0791
Google Scholar

Download references

Author information

Authors and Affiliations

MATS University, Raipur, India
Mamta Santosh Nair
MSIT, MATS University, Raipur, India
Umesh Kumar Pandey

Authors

Mamta Santosh Nair
View author publications
You can also search for this author in PubMed Google Scholar
Umesh Kumar Pandey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Mamta Santosh Nair or Umesh Kumar Pandey .

Editor information

Editors and Affiliations

Indian Institute of Information Technology, Allahabad, India
Uma Shanker Tiwary
Indian Institute of Technology, Jodhpur, India
Santanu Chaudhury

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nair, M.S., Pandey, U.K. (2020). Rule Generation of Cataract Patient Data Using Random Forest Algorithm. In: Tiwary, U., Chaudhury, S. (eds) Intelligent Human Computer Interaction. IHCI 2019. Lecture Notes in Computer Science(), vol 11886. Springer, Cham. https://doi.org/10.1007/978-3-030-44689-5_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-44689-5_16
Published: 12 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-44688-8
Online ISBN: 978-3-030-44689-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Rule Generation of Cataract Patient Data Using Random Forest Algorithm

Abstract

Similar content being viewed by others

Prediction of postoperative complications of pediatric cataract patients using data mining

Optimized Random Forest Algorithm with Parameter Tuning for Predicting Heart Disease

Hybrid Model for Heart Disease Prediction Using Random Forest and Logistic Regression

Keywords

1 Introduction

2 Literature Review

3 Data Collection and Research Instrument

4 Importance of Attributes

5 OOB Estimation Error

6 Analysis and Discussions

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Rule Generation of Cataract Patient Data Using Random Forest Algorithm

Abstract

Similar content being viewed by others

Prediction of postoperative complications of pediatric cataract patients using data mining

Optimized Random Forest Algorithm with Parameter Tuning for Predicting Heart Disease

Hybrid Model for Heart Disease Prediction Using Random Forest and Logistic Regression

Keywords

1 Introduction

2 Literature Review

3 Data Collection and Research Instrument

4 Importance of Attributes

5 OOB Estimation Error

6 Analysis and Discussions

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation