1 Introduction

Diabetes Mellitus (DM), also known as Diabetes, is a chronic disease characterized by elevated blood glucose levels (hyperglycemia). Diabetes Mellitus is classified into two types: Type 1 Diabetes Mellitus (T1DM) and Type 2 Diabetes Mellitus (T2DM) (T2DM). T1DM raises blood glucose levels due to insufficient insulin hormone secretion by the pancreas, but it is manageable with an ongoing supply of insulin hormone and blood glucose testing equipment. Type 2 Diabetes Mellitus (T2DM) occurs when blood glucose becomes resistant to insulin hormone, and T2DM increases the risk of Cardiovascular Disease (CVD). T2DM accounts for the vast majority of Diabetes cases, despite being largely preventable. The International Diabetes Federation (IDF) (International diabetes federation (IDF) diabetes atlas 2017) presents the most recent data on diabetes in 2017 and reports that approximately 425 million people aged 20–79 are estimated to have diabetes worldwide (T2DM).According to the IDF, if current trends continue, the number of people aged 20–79 will reach 629 million by 2045. In general, one out of every ten people will be affected (International diabetes federation (IDF) diabetes atlas 2017).To manage the severe impact of Diabetes on humanity, this paper proposes a Transparent Diabetes Management System (TDMSML), which can assist doctors and the general public in managing the severe impact of Diabetes on humanity.

With the advent of the internet and efficient communication, the medical science industry has amassed a massive amount of relevant and valuable data that has not been properly mined and organized for optimal use. The discovery of hidden patterns and their relationships in these data sets is frequently underutilized and unknown. Fortunately, many data mining techniques, such as Artificial Neural Network (ANN), Support Vector Machine (SVM), Decision Tree (DT), and others, are available to extract useful knowledge from data. However, the majority of the techniques are black-box and cannot be used to make decisions with explanation. However, Decision Tree (DT) (Sankaranarayanan 2014) is a transparent data mining technique that identifies the major constituents of a decision and explicitly explains how a decision is made by generating user understandable decision rules. As a result, the proposed Transparent Diabetes Management System (TDMSML) employs Decision Tree (Sankaranarayanan 2014) to identify the key factors causing diabetes.

The decision tree algorithm is used to generate the rule-set at the very beginning. Following that, the transparent and significant rules are chosen from the extracted rule-set, and the redundant rules are pruned to yield a minimized rule-set using the proposed Sequential Transparent Floating Forward–Rule Pruning Algorithm (STFF–RPA).This STFF–RPA algorithm is designed based on the Sequential Floating Forward Search (SFFS) (Lv et al. 2015) technique. However, it has been observed that sometimes the SFFS technique is not able to detect some subsets of rules that could outperform the current subset of rules. To address this issue, the proposed SDTFFS algorithm incorporates a step known as ‘Deep Transparency Search’ (DTS) to allow these potential subsets to be found. During the rule selection step, the decision criteria for selecting a rule is based on the value of Significance of Rule (SOR), which is calculated using two formulas shown in Eqs. 6 and 7. Once the transparent rule-set is obtained, the algorithm may extract some redundant rules, increasing the difficulty in understanding the major attributes/causes of a decision. Removing these redundant rules will increase transparency even more. Therefore, in the subsequent stages, the proposed STFF–RPA prunes the redundant and irrelevant rules and merges the pruned rule-set into a single rule. The range of each attribute in the merged rule is then reversed, and the misclassification rate is calculated individually. The attributes with the highest misclassification rates are considered as the major factor(s)/attribute(s), because the misclassification rate is very high after reversing their calculated data ranges. This means the attributes and their calculated data ranges are very important for accurately diagnosing the disease. As a result, these factor(s)/attribute(s), along with their calculated ranges, are regarded as major cause(s) of diabetes. Diabetes can be managed/controlled to a large extent if the data ranges of the major factor(s)/attribute(s) are controlled, with proper medication as per experts’ advice, food habits, or exercise.

The paper is organized as follows: Sect. 2 highlights the existing literature related to Diabetes diagnosing and prediction using DT, Sect. 3 describes the proposed TDMSML system in detail, Sect. 4 presents the result of Diabetes management using the Pima Indian Diabetes Data Set collected from Kaggle, and finally the Sect. 5 draws conclusion.

2 Literature survey

In recent years, DT are broadly used by the researchers to predict as well as diagnosing health-related problems. There are various DT algorithms are available which performs exceptionally well to diagnose the diseases, such as diabetes, breast cancer, heart disease and many more (Podgorelec et al. 2002; Stiglic et al. 2012). DT is the most prevailing algorithm as compared to other classification techniques viz. Artificial Neural Network, Naïve Base Classifier, SVM etc., due to its transparency in nature which helps in identifying the major responsible attributes for a disease and also to validating the outcomes (Podgorelec et al. 2005; Rajkumar and Reena 2010; Dangare and Apte 2012; Kokol et al. 1994; Azar and Bitar 2015; Zorman et al. 1997). Apart from transparency in nature DT extracts the range of values for the major responsible attributes in diagnosing a disease. Till now many interesting models have been proposed using decision trees for medical diagnosis, some of the relevant implementations are mentioned below:

Biswas et al. (2018) proposed an algorithm, called Rule-Based Major Feature Identification (RBMFI) which extracts the most important responsible factors of a diseases by pruning production-rules generated by DT. Azar et al. (2018) made a comparative study of nine machine learning techniques for classification of eight different diseases using two modeling techniques: cross-validation and boot-strapping. Purushottam et al. (2015) designed a system that can efficiently discover the rules to predict the risk level of patients, based on the given parameter about their health. Followed by the comparison of results of the proposed system using C4.5 rules and partial tree, in terms of different parameters. Panigrahi et al. (2016) used different classification approach named DT algorithm, Bayes algorithm and rule-based algorithm. These algorithms are evaluated on the basis of error rates. (Shen et al. 2010) proposed an attribute reduction algorithm, where the irrelevant condition attributes are pruned. After pruning the irrelevant attributes, this model evaluates the superiority of the different subsets of the candidate attribute based on a fitness function derived by formulation. Liu et al. (1970) designed a new rule pruning technique by removing unwanted terms from the rule. For rules which hold multiple test conditions in each attribute, the proposed technique examined both the upper and lower bound individually as well as together and retains the attribute with higher fitness value. Vijayan et al. (2015) provided a review to highlight the benefits of different preprocessing techniques for decision support systems for predicting diabetes which are based on Support Vector Machine (SVM), Naive Bayes classifier and Decision Tree. Shettty et al. (2017) proposed a model to assemble Intelligent Diabetes Disease Prediction System (IDDPS) that gives analysis of diabetes malady utilizing diabetes patient's database. In this system, they used Bayesian and K-NN (K-Nearest Neighbor) algorithm, analyzed them by taking various attributes of diabetes for prediction of diabetes disease. Morteza et al. (2015) proposes a RF + HC method in which for rule extraction the Random Forest is used. Once the random forest is built, hill climbing algorithms are used to search for an efficient set of rules which reduces the number of rules significantly. Morteza et al. (2017) this paper proposes three algorithms to extract important rules from decision tree ensembles. All the three algorithms viz. RF + DHC, RF + SGL and RF + MSGL uses random forest to generate the decision rules. However, the selection of the significant rule-set is different. RF + DHC uses downhill climbing algorithm for rule selection, whereas RF + SGL uses sparse group lasso and RF + MSGL uses multi class sparse group lasso to extract the significant rules. Pradeep et al. (2016) proposed a methodology comprises of two steps including feature extraction and classification by J48 decision tree algorithm for diabetes detection along with online web-based remote patient monitoring application. International Diabetes Federation (IDF) (International diabetes federation (IDF) diabetes atlas 2017) diabetes atlas—8th edition provided some very useful information regarding statistics of diabetes disease, the rate at which diabetes is increasing rapidly and how the humanity is going to be affected. Guo et al. (2012) aimed at the discovery of a decision tree model for diagnosis of type-2 diabetes. This model uses some pre-processing techniques and followed by the Naïve Bayes model construction. Bashir et al. (2014) used decision trees as base classifier, which differ on splitting criterion named ID3, C4.5 and CART. These decision trees are then combined using different ensemble techniques. Huang et al. (2016) used Support Vector Machine (SVM) to classify the given data step by step. Incorrectly classified patterns were fed to the succeeding stage to find a better split point in SVM. Split point was used to calculate information gain that can identify principal features from candidate attributes. Tsipouras et al. (2006) presented a decision support system using fuzzy model and optimization of the parameters obtained from decision tree using fuzzy model. Chen et al. (2017) proposed a hybrid prediction model, where K-means was used for the reduction of data with J48 decision tree as a classifier to help the diagnosis of Type 2 diabetes. In the proposed model. Tanner et al. (2008) reported that decision algorithms can predict and diagnose dengue disease using simple clinical and hematological parameters. The proposed model used C4.5 decision tree algorithm along with a parameter called minimal cases which acts as the stopping criterion for further partitioning of the data at a particular decision node. Various decision trees were generated using the parameter and on the basis of performance value, the corresponding tree is chosen for further analysis of sensitivity and specificity. (Liu et al. 2016) designed a modified version of C4.5 DT algorithm based on RELIEFF technique which is used for attribute weighting in disease diagnosing. Here, the RELIEFF method is used to prune the redundant attributes. (Albu 2017) used DT algorithm for hepatitis prediction which presents the simple automated system that diagnoses hepatitis. (Pashaei et al. 2015) proposed a model which uses C4.5 decision tree algorithm with boosted version of C5.0 DT algorithm as the fitness function to improvise the outcome of Particle Swarm Optimization (PSO) for medical analysis. Amiri et al. (2013) developed a medical diagnostic system which is implemented on heart disease detection and identification using Classification and Regression Trees (CART). They have demonstrated the potential of CART and suitable encoding scheme for the heart sound data as innocent or pathological murmurs in newborns. The output of their system can help the physicians to decide whether to send a new born for an echocardiogram or not. (Saranya et al. 2017) proposed a system which uses cluster and decision tree analytics-based supervised learning and forms a hub between patients, doctors and dieticians by providing mobile and web application-based solution to pregnant women’s on various health-related issues, diet tips etc. Corpus of responses collected from physicians and dietitians is used for creating the system. Sankaranarayanan et al. (2014) used the concepts of classification methods that had been applied to have a watchful study of Diabetes. The data set contained 8 continuous attributes and 768 instances and two classes along with a decision attribute that determines either a person is or not having diabetes mellitus. Ronald et al. (2018) in this paper the basic concepts of SFFS algorithm is used for sub-group feature identification. Nakariyakul et al. (2009) proposes an algorithm which is basically the improvised version of sequential forward floating search named Improved Forward Floating Search (IFFS). Jia et al. (2015) proposes sequential deep floating forward search algorithm which uses the concept of Sequential Floating Forward Search (SFFS) with a minor modification which includes an extra step of deep search. Fazil et al. (2017) focused on finding the linkage between blood glucose and cholesterol levels in the pre-diabetic subjects as to explain the cause of diabetes and cardiovascular. Wei et al. (2018) made a comprehensive exploration to the most popular classification techniques (Deep Neural Network (DNN), Support Vector Machine (SVM), Logistic Regression, Decision Tree, and Naïve Bayes) used to identify diabetes. Sumangali et al. (2016) focused to classify the data in diabetic or non-Diabetic class labels and improved the classification accuracy, using the advantage of combination of the CART and RF which increases accuracy and resolves the problem of over fitting. Shivakumar et al. (2014) provided a survey of data mining methods that have been commonly applied viz. association rule, clustering, classification on Pima Indian diabetes data sets to Diabetes data analysis and prediction of the disease. A method for computer-assisted diagnosis of skin cancers in dermatology was put forwarded by (Handels et al. 1999). Malignant melanoma and nevocytic nevi (moles) were automatically recognized through the analysis of high resolution skin surface profiles. (Salem et al. 2018) proposed a two-phase technique to classify images of lesions into benign or malignant. The first phase consisting of an image processing-based method that extracts the Asymmetry, Border Irregularity, Color Variation and Diameter of a given mole. The second phase classifies lesions using a Genetic Algorithm. Podgorelec et al. (2001) introduced the integrated computerized environment DIAPRO enabling the diagnostic process optimization which is based on a single approach–evolutionary algorithms.

3 Proposed TDMSML model

This section describes the proposed Transparent Diabetes Management System with Machine Learning (TDMSML) model, which identifies preventive measures for Diabetes disease by identifying the most significant factors with ranges using a Decision tree. The proposed TDMSML is divided into three phases: rule generation, transparent rule selection, and identification of major factors. The decision tree is used to generate a rule-set, and the Transparent Rule Selection stage selects the transparent rules from the rules generated by the decision tree, followed by Major Factor Identification, which identifies the significant factors responsible for diabetes disease. Figure 1 depicts a schematic representation of the proposed TDMSML model.

Fig. 1
figure 1

Transparent disease management system using machine learning

3.1 Rule generation

The TDMSML model generates production rules using the C4.5 decision tree algorithm. To build the decision tree during the training stage, the C4.5 employs a top-down strategy based on the divide-and-conquer approach. It maps the training set and uses the information gain ratio as a metric to select splitting attributes before generating nodes from the root to the leaves. Every illustrating path from the root node to the leaf node constitutes a decision rule for determining which class a new instance belongs to. To account for unknown attribute values, the root node contains the entire training set, with all training case weights set to 1.0. If all of the current node’s training cases belong to the same class, the algorithm terminates. Otherwise, if all training cases belong to more than one class, the algorithm computes the information gain ratio for each attribute. The attribute with the highest information gain ratio is selected as the best attribute for splitting. Equation (1) gives the expected information needed to split a node based on an attribute A, where D is a set of (D1Dj) samples with 'm' distinct classes.

$$ {\text{Gain Ratio }}\left( A \right) = { }\frac{{{\text{Gain }}\left( A \right)}}{{{\text{Split}} - {\text{Info}}_{A} \left( D \right)}} $$
(1)
$$ {\text{Gain }}\left( A \right) = {\text{ Info}}\left( A \right) - {\text{Info}}_{A} \left( D \right) $$
(2)
$$ {\text{Split}} - {\text{Info}}_{A} \left( D \right)\,\, = \,\, - { }\sum {\frac{{\left| {D_{{\text{i}}} } \right|}}{\left| D \right|}} {* }\log_{2} \left( {{ }\frac{{\left| {D_{{\text{i}}} } \right|}}{\left| D \right|}{ }} \right) $$
(3)
$$ {\text{Info }}\left( D \right) = { } - { }\sum {{\text{Prob}}_{{\text{i}}} {* }\log_{2} \left( {{\text{Prob}}_{{\text{i}}} } \right)} $$
(4)
$$ {\text{Info}}_{A} \left( D \right) = { } - { }\sum {\frac{{\left| {D_{{\text{i}}} } \right|}}{\left| D \right|}} {\text{* Info }}\left( {D_{{\text{i}}} } \right) $$
(5)

Gain (A) in (2) represents the expected reduction in entropy caused by knowing the value of attribute A, and Split-InfoA (D) in (3) represents the potential information obtained by partitioning the training data set D into n partitions while considering attribute A. Info (D) in (4) measures the class impurity before splitting the data set D, whereas \({\text{Info}}_{A} \left( D \right)\) in (5) specifies the average entropy after the split of data set D based on the attribute values of A (Probi = probability of distinct class Ci, (|Di|/|D|) = act as the weight of ith partition) (Navada et al. 2011).

3.2 Transparent rule selection

To obtain a transparent and minimized rule-set, the TDMSML model proposes the Sequential Transparent Floating Forward–Rule Pruning Algorithm (STFF–RPA). The proposed algorithm takes decision rules as input and outputs significant ranges identifying the major factor(s)/attribute(s) responsible for diabetes disease. STFF–RPA is divided into two stages: transparent rule extraction and rule pruning. It extracts the most transparent rule-set from a set of decision rules in the transparent rule extraction phase, whereas the rule pruning phase determines the best rules from each training set in terms of highest SOR value, followed by pruning the unnecessary rules and merging the refined rules into a single rule. The following are the stages of the proposed STFF–RPA:

3.2.1 Transparent rule extraction

The Sequential Deep Transparent Floating Forward Search (SDTFFS) algorithm is proposed for extracting transparent rules. SDTFFS has three steps: Sequential Forward Search (SFS), Deep Transparency Search (DTS), and Sequential Backward Search (SBS).In all three steps, the deciding criteria for whether or not to select any rule is based on the calculation of Significance of Rule (SOR). Equations 6 and 7 provide the formula for calculating SOR value:

$$ AOR = \left[ { \left( {\frac{CC - IC}{{CC + IC}}} \right) + \left( {\frac{CC}{{IC + 1}}} \right) - \frac{IC}{{CC}} } \right] $$
(6)
$$ SOR = AOR + \frac{CC}{{RL}} $$
(7)

where AOR = Accuracy of the Rule, SOR = Significance of Rule, CC = Total number of correctly classified patterns by a rule-set, IC = Total number of incorrectly classified patterns by a rule-set and RL = Rule Length. The process flow of the SDTFFS is shown in Fig. 2.

Fig. 2
figure 2

Sequential deep transparent floating forward search (SDTFFS)

The detailed explanation of the stages of proposed SDTFFS algorithm is given below:

3.2.1.1 Sequential forward search (SFS)

Starting with an empty rule-set, in each iteration the algorithm adds a new rule to the current rule-set. While adding a new rule to the current rule-set, it checks whether addition of the rule increases the overall SOR value of the current rule-set or not. If it increase the SOR value of the current rule-set then the corresponding rule is added otherwise discarded. If there is no such rule available that could increase the SOR value of current rule-set then the proposed SDTFFS algorithm terminates and the current rule-set will be the resultant rule-set of this stage.

3.2.1.2 Deep transparency search (DTS)

It is assumed that when the proposed SDTFFS algorithm enters into DTS stage, the rule that was newly added by SFS algorithm, marked as seed and the initial size of the rule-set is assumed to be k. For each iteration, DTS algorithm will create a set of rule-sets of size (k–1) removing each rule at a time from current rule-set (S) keeping the seed intact, which will produce N = k−1Ck–2 new rule-sets of size (k–1). To every N new rule-sets, it adds a new rule having highest SOR value from the remaining set of rules those were not present in current rule-set (S). Therefore, in total N new k-rule-set (rule-set of size k) is obtained. Now, among these N new k–rule-sets, if there exists any rule-set having highest SOR value as well as greater than previous SOR value of current rule-set (S), is considered as new current rule-set (S). If no such rule-set is found then the current rule-set (S) is kept as it is.

3.2.1.3 Sequential backward search (SBS)

The k-rule-set (rule-set of size k) obtained from DTS step, fed as input to SBS step. In this step the algorithm starts removing each rule at a time from current k-rule-set, and finds all (k–1)-rule-set (rule-set of size k–1). Now it calculates the SOR values of all (k–1)-rule-sets. The rule-set having highest SOR value as well as greater than the SOR value of (k–1)-rule-set of previous iteration, is considered as current best (k–1)-rule-set and forwards it to next iteration.

The summarized SDTFFS algorithm is explained below:

figure a
figure b

3.2.2 Rule pruning phase

The proposed STFF–RPA prunes the transparent rule-set to obtain a single optimal rule. STFF–RPA algorithm is divided into three stages: rule selection, rule pruning, and rule merging. In rule selection stage, the most promising transparent rule is selected by calculating the SOR from each training set (tenfold cross validation). Then, by Rule pruning, redundant rules are pruned to make the proposed TDMSML more transparent. The obtained pruned rules are merged in to a single rule. The detailed explanations on each step is given below:

3.2.2.1 Rule selection

From the obtained transparent rule-set for each training set, the rule which produces highest SOR value is identified and this rule is considered as the final and most promising rule for that training set. This selection procedure is continued for all the training sets. The algorithmic representation is given below:

figure c
3.2.2.2 Rule pruning

In the rule pruning process, pruning is done on the basis of sensitivity analysis of each rule obtained from the rule selection phase. While pruning, first if there are different ranges of an attribute in a single rule, then sensitivity of each range is calculated for that rule only. The range which on pruning lowers the sensitivity of that rule considered as important range. After pruning on the ranges, the sensitivity is calculated for each rule. The rules which on pruning lower the sensitivity of the rule-set are considered as more significant and kept unchanged. Rest redundant rules are removed from the current rule-set. The rule pruning algorithm is summarized below:

figure d
3.2.2.3 Rule merging

The significant rules in the rule-set are further refined by focusing on the attributes of the rule-set. Similar attributes with different ranges in more than one rule are identified and they are pruned one by one if their absence from the rule set increases the accuracy of the set. The final rule is generated by merging all the refined significant rules in the set into a single rule in a reasonable way. The abstract is represented in algorithmic way below:

figure e

3.3 Major factor identification

To identify the major factor(s) from the pruned and merged (single) rule, the Sequential Search algorithm is used. The basic concept of this algorithm is based on the evaluation of each attribute independently. Therefore, in this stage, each attribute present in merged rule, is taken under consideration, while keeping other attributes inactive and prevention/misclassification rate is calculated. The attribute which produces highest average misclassification rate for positive class, when its range is reversed, is considered as major factor for diabetes disease. If more than one major factor needs to be identified then the incremental Sequential Search process is iteratively executed to identify intended number of major factor(s) with range(s). The flow chart of the algorithm is shown in Fig. 3. The abstract of the algorithm is written below:

Fig. 3
figure 3

Major factor identification

figure f

4 Results and discussion

The Pima Indian Diabetes data set is used here. In this section the entire TDMSML model work flow is explained clearly with results. The diabetes data set contains total 768 patterns, 8 attributes and 2 classes indicating the occurrences and non-occurrences of diabetes disease. tenfold cross-validation is performed to validate the model. The data set comprises of the following:

  • x1 \(\to\) Pregnancies—Number of times pregnant

  • x2 \(\to\) Glucose—Plasma glucose concentration a 2 h in an oral glucose tolerance test

  • x3 \(\to\) Blood Pressure—Diastolic blood pressure (mm Hg)

  • x4 \(\to\) Skin Thickness—Triceps skin fold thickness (mm)

  • x5 \(\to\) Insulin—2-h serum insulin (mu U/ml)

  • x6 \(\to\) BMI—Body mass index (weight in kg/(height in m)^2)

  • x7 \(\to\) Diabetes pedigree function

  • x8 \(\to\) Age (years)

4.1 Rule generation

Set of rules generated for the positive class (patients diagnosed with diabetes) of training set 1 are shown below. The rules are generated by C4.5 decision tree algorithm.

figure g

4.2 Transparent rule selection

As discussed earlier, this stage finds out the minimized transparent rule-set by extracting transparent rule-set followed by rule pruning (to remove the redundant rules) and rule merging (to merge the rules into one single rule). The detailed explanation along with results of each and every step is given below:

4.2.1 Transparent rule extraction

This section extracts the transparent rule-set from the set of rules generated by decision tree. In this stage from the training set 1, rule number-3 and rule number-6 (shown in the above box), are selected as the transparent rule-set using the proposed SDTFFS algorithm. The resultant transparent rule-set is shown in the box given below:

figure h

4.2.2 Rule pruning phase

Once the transparent rule-set is obtained, the proposed STFF–RPA selects the most promising rule followed by the removal of redundant rules which might increase the transparency as well as the classification accuracy. Rule pruning is done in the following manner.

4.2.2.1 Rule selection

It is clearly visible that the transparent rule-set extracted in Sect. 4.2.1 contains two rules. Among those two rules one rule is selected which has higher SOR value than other(s). In this case the second (2nd) rule has highest SOR value. Therefore, the rule given in the box below is considered as the most promising and transparent rule among all the rules in training set 1.

figure i

Similarly, most promising rules are selected from all the remaining training sets. As a result, finally total 10 rules are selected from 10 training sets. The final transparent rule-set is given in the box below:

figure j
4.2.2.2 Rule pruning

From the set of 10 rules obtained in Sect. 4.2.2.1, it is clearly visible that there are some repetitive rules and in some of the rules the same attribute has different ranges. These redundant rules and attributes needs to be removed to increase the transparency of the set. The sensitivity of attribute(s) with different ranges in a rule is calculated individually keeping all other rules inactive and that attribute range is dropped from the rule which is less sensitive in making the decision. For example, the first rule of the set has attribute x2 with two different ranges. With these two ranges in the rule, 81 positive patterns out of a total of 268 positive patterns were correctly classified, and without the range (x2 >  = 0.791457), the number of correctly classified positive patterns increased to 151. As a result, the range (x2 > = 0.791457) is removed from rule 1 of the set. This procedure is repeated for all of the rules in the rule-set. After removing all such ambiguous ranges of single attribute in a rule, the sensitivity of each rule is calculated. If the number of properly classified positive patterns increases when a rule is removed, that rule is permanently dropped. The pruned rule-set is shown in the box below:

figure k
4.2.2.3 Rule merging

The minimized rule-set obtained during the rule pruning phase reveals two attributes with different data ranges, x2 and x6. As a result, all such ranges are identified and pruned one by one during the rule merging procedure. If removing a data range from the rule-set increases the number of correctly classified positive patterns, that range is permanently removed. Following this, all of the different rules are combined into a single rule, as shown below:

figure l

4.3 Major factor identification

The previous step resulted in a merged rule with two attributes, x2 and x6. Therefore, during the major factor identification phase, it is determined which of the attribute(s) is/are most important for diagnosing the disease. To test the misclassification rates, the ranges of each attribute are reversed individually. The greater the misclassification rate, the greater the importance of the respective attribute. Table 1 shows the misclassification rates for single attributes.

Table 1 Misclassification rates when condition reversed (one attribute)

Table 1 clearly shows that attribute x6 is the most promising cause, with a high misclassification rate when its range is reversed. Therefore, if the scenario is to select single major factor responsible for diabetes then x6 is selected as the one. The same procedure is now followed incrementally when selecting more than one major factor. The misclassification rates with two attributes are shown in Table 2.

Table 2 Misclassification rates when condition reversed (two attribute)

All of the results presented above show that reversing the data ranges of more attributes increases the misclassification rate. Therefore, it can said that prevention rate will also increase with the increase in the number of factors to be controlled. Though the prevention rate increases with the increase in factor(s) but it may be difficult to control more factor(s) within its range that are responsible for diabetes disease. However, from the analysis it is clear that, even if the range of one significant attribute is controlled, the disease can be managed effectively.

5 Conclusion

The proposed TDMSML model is a method for determining the significant factor(s) and their range(s) for managing diabetes. If the range(s) of only the significant factor(s)/attribute(s) can be controlled, the occurrence of diabetes disease can be managed to a great extent as well as individuals can be saved from different health issues. The proposed model effectively generates transparent production rules from diabetes data sets using decision tree algorithm. To identify the major factor(s) from these rules, the transparent rules are selected as well as pruned to remove the redundancies and the significant factor(s)/attribute(s) with range(s) are finally identified using the proposed transparent rule selection and rule pruning algorithms. According to the experimental results, it can be said that diabetes can be significantly managed by controlling one or two attributes. Consequently, it can also be concluded that the TDMSML can be used to assist the physicians in analyzing the medical records and in effectively preventing diabetes. This proposed model is also expected to bring about a revolutionary change in the medical field for diabetes prevention. The proposed model generates production rules using a decision tree; however, other machine learning algorithms such as neural networks, etc. can be used. Rule and attribute pruning algorithms can also be improved for greater efficiency.