1 Introduction

Data Mining is a process which discovers knowledge from large volumes of data by applying data analysis and discovery algorithms [1; 2]. Classification, a major data mining functionality, is a supervised learning method where the example set called the training set is used to classify the given query data item into one of the predefined classes, where a classifier derived from the training set like a decision tree or a neural network or a support vector machine or any other classifier may be used [3].

No Free Lunch [3] theorem states that there is no the best classifier suitable for all problems. In rare cases where the probability structure of the problem is fully known, the Bayes classifier gives the minimum error and hence is the best one. However, one has to, often work with a limited training set, even to derive probabilities like class conditional densities. So, the Bayes classifier which uses estimated probabilities (from a limited training set) is not the best one. Some recent classifiers like support vector machines [4; 5] are shown to have good generalization ability especially with limited training set problems. But these methods and many other methods like artificial neural networks have an important limitation called lack of interpret-ability along with its inability to work with categorical or non-numeric attributes. That is, they cannot give reasons (in human understandable terms) for the decision it made. Instance based classifiers like nearest neighbor classifier [6] or case based methods [7] show good performance and also has good interpret-ability and can deal with non-numeric attributes provided a proper distance measure or rules are available. However the classification time these classifiers take is often proportionate to the training set (or rule set) size, hence are slow. Decision tree classifiers, where the classification rules are arranged in the form of a tree show good performance, are applicable to work with non-numeric attributes and also give good interpretations. Because of the tree structure, they are faster than other rule-based methods [8; 9]. Hence, decision tree classifiers are popular in data mining applications.

In the decision tree induction process, for a node that corresponds to a continuous valued attribute, the testing is done in two standard ways, viz., by splitting the attribute range into two intervals using a cut point [10], or into many intervals using several cut points [11]. That is, a partition of the range can be found, which can be called hard discretization of the range of values. Conventional decision trees which are built by using continuous valued attributes, with a crisp threshold based testing at a node, do not perform well, especially in the presence of noise [12; 13]. Two objects which are very near in the feature space can be classified into different classes. Hence, the crisp threshold (cut point) is not always desirable [14], and it may produce high misclassification rates since it has poor generalization ability. This is one of the reasons why decision trees are unstable, i.e., a small perturbation in the training set can result in a drastically different decision tree. This is the reason why decision trees have high variance in their error component [15]. To overcome this problem, some techniques use probability based approaches [1618], some use possibility based methods [1921]. One of the approaches to solve this problem is to use fuzzyFootnote 1 decision trees where standard decision trees are combined with fuzzy rules.

An improvement over the above one is to derive a soft partition of the range called soft discretization [21; 22] using fuzzy-set theory. Fuzzy decision trees have been used for many applications like diagnosis systems [23], video mining [24], landslide susceptibility detection [25] and to many other classification problems. Moreover, fuzzy decision trees have good interpretability than standard decision trees. Decision trees, in classifying a test pattern, use a single rule that corresponds to the path from the root to a leaf. All other rules are not satisfied. However, several rules might be satisfied with varying degrees of precision, which can be combined at a later stage to get the class label. This approach can reduce the variance component of the error [15]. A fuzzy decision tree induction method to solve regression problems was introduced in [15] where growing and pruning are combined to determine the structure of the soft decision tree with the techniques called refitting and backfitting. The effect of using various information measures in building fuzzy decision trees is studied in [19].

An ensemble of classifiers, rather than using a single one is also known to reduce variance component of the error [2628]. The most popular techniques for constructing ensembles are, (i) bootstrap aggregation or bagging [29; 30], (ii) the Adaboost family of algorithms [31] (called boosting) and (iii) random forests [32; 33]. Randomness was used in the induction process of decision trees, which are shown to result in a better classifier [34]. Random forests based decision trees perform significantly better than bagging andboosting [35]. These decision trees have smaller error rate and are also more robust to noise than the other methods [35]. In [36] a completely random decision tree algorithm is proposed, where at each node the splitting attribute is randomly chosen without using any heuristic measures such as information gain, gain ratio and gini index. It means the choice of selection of attributes at each node is completely stochastic. Finally, the authors have concluded using experimental results that their model achieves significantly higher accuracy than the single hypothesis. In [37] the authors have clearly justified about the reason for higher accuracy of randomized decision tree approaches using posterior probability estimations. The authors also show that the mean squared error (MSE) criteria by randomized decision trees is reduced in both bias and variance. The other method is [38] where a subset of features is randomly generated from the original feature set and on each subset a conventional decision tree is induced and shown that their computations are better. Dietterich’s method [39], where the splitting attribute is randomly chosen among the top k attributes which have highest information gain. In recent works, ensemble methods are combined with fuzzy-set based learning algorithms [4043]. Another recent work based on ensemble method is a fuzzy random forest, where randomization is used to select a splitting attribute from a set of candidate attributes at every node [44]. A comparative study of combination methods of fuzzy Vs non-fuzzy based classifiers was done and concluded that better results are obtained with fuzzy based classifiers [45]. A good theoretical analysis was shown in [46] about how linear combiners of classifiers perform well, which depends on the performance of individual classifiers.

This paper, basically extends the ideas presented in [10] and [21] where a soft partition of the predefined width (informally, this will measure the degree of overlap between blocks of the soft partition) is obtained for continuous valued attributes. Since there is an inherent vagueness in choosing the attribute, its cut point, its width, etc., the paper proposes to build an ensemble of soft decision trees where the parameters (like the attribute, cut point, etc.) are chosen from probability distributions (which are obtained from the training set). Because of the randomness injected at various levels, the component classifiers are supposed to be sufficiently independent of each other, so that their consensus decision is going to be a better one.

The rest of the paper is organized as follows. Section 2 discusses conventional decision trees along with their shortcomings. Section 3 initially gives few definitions needed to describe fuzzy systems, then, presents an induction method of the basic soft decision tree. Section 4 describes the core of the paper, i.e., a randomized soft decision tree induction method, followed by the discussion of an ensemble of randomized soft decision trees. Experimental studies are given in Section 6. Section 7 concludes the paper.

2 Standard decision tree induction

Let the data set be \(\mathcal {X} = \{ (X_{i}, y_{i}) \mid i=1,2, {\ldots } n \}\), where X i represents a d-dimensional feature vector and y i represents the corresponding class label. Let the set of attributes (features) be \(\mathcal {A} = \{A_{1}, A_{2}, \ldots , A_{j}, \ldots , A_{d}\}\), which are all assumed to be of continuous valued.Footnote 2

To insert a node in the decision tree building process, the goodness of the attribute to be tested at the node is normally found based on drop in certain impurity measure, like entropy, gini-index, and gain-ratio [1]. Usually entropy based impurity measure called Information Gain is used for continuous valued attributes which is described below. Let the data set be \(\mathcal {X}\), its entropy is given by

$$ Entropy(\mathcal{X})= {\sum\limits_{i=1}^{c}} -p_{i}\log_{2}(p_{i}), $$
(1)

where c represents number of class labels, for our experiments c is 2, p i represents the probability of data instances that belong to class i.

Let \(a_{i_{1}},a_{i_{2}},\ldots ,a_{i_{m}}\) be an ordered sequence of distinct values of an attribute A i , i 𝜖{1,2,…,d}, as given in the training set. Then for every pair of values \((a_{i_{k}}, a_{i_{k+1}})\) a possible threshold \(\tau _{k}=(a_{i_{k}} + a_{i_{k+1}})/2\), for k=1to i m−1, is examined for a cut point to partition \(\mathcal {X}\) into two blocks \(\mathcal {X}_{1}^{(i)}\) and \(\mathcal {X}_{2}^{(i)}\). Weighted average of the entropy of the resulting class distributions after the testing would be

$$ \textit{Info}(A_{i}, \tau_{k}) = \frac{n_{1}}{n} Entropy(\mathcal{X}_{1}^{(i)})+\frac{n_{2}}{n} Entropy(\mathcal{X}_{2}^{(i)}), $$
(2)

where n 1 and n 2 are the number of tuples in blocks \(\mathcal {X}_{1}^{(i)}\) and \(\mathcal {X}_{2}^{(i)}\). Information Gain G of this testing is

$$ G(A_{i}, \tau_{k})=Entropy(\mathcal{X})-\textit{Info}(A_{i}, \tau_{k}). $$
(3)

The attribute and its cut point, which gives maximum Information Gain are chosen for the testing in the node. The process is recursively repeated to build the child nodes [1].

2.1 Problems with standard decision trees

Some Limitations of the standard decision trees are described below.

To induce a decision tree for a continuous valued data, using the above discussed method, each attribute needs n−1 evaluations (where n is the number of instances, assuming that an attribute is having n distinct values.), to determine the optimal cut point at which information gain is maximum. Therefore it is relatively expensive to work with large data sets. Here, we followed Fayyad [10] idea, which has shown only the class boundary points are to be examined to determine the optimal cut point which gives the maximum information gain for the attribute. A cut point is called a boundary cut point, if it falls between a pair of successive instances that belong to two different classes. Here onwards cut point means that which lies at a class boundary. Even though, in practice, this significantly reduces the number of evaluations, its worst case time complexity still is O(n).

Another limitation is discussed with an illustrative example. Table 1 shows a toy data belonging to a two class problem which has two attributes height and weight and two class labels tiger and cat. Let the crisp rule induced (by applying the method described in Section 2) to the toy data is as follows,

$$\text{IF}\; height > 23.35~~ \text{THEN} \;class-label = tiger. $$
Table 1 A toy dataset.

This rule works well in classification for the instances that have no noise.Footnote 3 Noisy instances are more likely to present in the training data while reading feature values such as height, which might decrease performance of the hard decision tree. Hence, crisp boundary based (cut point) decision rules do not resolve uncertainties in the data. In order to overcome this, soft decision tree based on the fuzzy set theory is used and is explained in the next section.

3 Soft decision tree

This section explains basic concepts of fuzzy set theory followed by its application to standard decision tree called design of soft decision tree.

3.1 Fuzzy set theory

The definitions of crisp set, fuzzy set, crisp partition and fuzzy partition are defined below. Details can be found in [47]. Let O be the collection of objects in the Universe of discourse.

Crisp Set: A Crisp set C is a subset of the Universe of discourse O and it is expressed with a sharp characteristic (membership) function μ C : O → {0, 1}. For each object oO,

$$\mu_{C}(o) = \left\{ \begin{array}{ll} 1 & \text{if} ~~ o\in C, \\ 0 & \text{otherwise.} \end{array} \right. $$

Fuzzy Set: A Fuzzy set F is a subset of the Universe of discourse O, whose characteristic (membership) function is μ F :O → [0, 1]. For each object oO,

$$\mu_{F}(o) = \left\{ \begin{array}{ll} 1 & \text{if} ~~ o\in F, \\ x & 0<x<1, o\; \text{is in}\; F\; \text{with a degree or membership value}\;~x,\\ 0 & \text{otherwise}, {i.e.,} o \not\in F. \end{array} \right. $$

Partition: A hard or crisp partition of the set O is the non-empty set of blocks {B 1, B 2, . . . , B p }, where each block is a non-empty crisp set,

$$\begin{array}{l} B_{i} \subseteq O~~ \text{for}~1\le i\le p, \\ B_{i}\cap B_{j} = \phi ~\text{and} \\ \bigcup_{i=1}^{p} B_{i} = O. \end{array} $$

In other words, it is a mutually disjoint and collectively exhaustive collection of blocks.

Fuzzy Partition: A fuzzy or soft partition of O is the non-empty set of fuzzy sets {F 1,F 2,…,F p }, where each fuzzy set, also known as fuzzy block, is non-empty (i.e., there is an element whose membership value is other than 0), and for each oO,

$${\sum\limits_{i=1}^{p}} \mu_{F_{i}}(o) = 1. $$

Fuzzy blocks of a fuzzy partition can be given human understandable names called linguistic variables. For example the height may be divided into two fuzzy blocks called “short” and “tall”. Then a rule can be stated as

$$\text{IF}\; {height}\; \text{is ``tall'' THEN}\; {class-label} = \text{tiger}. $$

The degree by which the height is in the fuzzy block “tall” determines the strength of this rule.

This rule is better understandable, hence has good interpret-ability than the rule:

$$\text{IF}\; {height} > 23.35\; \text{THEN}\; {class-label} = \text{tiger}. $$

Figure 1 illustrates the above mentioned example. In case of the soft partition, suppose that the height given is 23.4 there is a scope for the final class-label to be either cat or tiger. Whereas in the hard partition, one of this is ruled out.

Figure 1
figure 1

Hard partition Vs soft partition.

3.2 Design of soft decision tree

The building process of a soft decision tree is similar to that of the standard decision tree (as discussed in Section 2) except, the way in which the goodness measure is calculated that uses the fuzzy membership values.

Let B be the continuous valued attribute and let τ be the optimal threshold or cut point choosen among all possible thresholds (as discussed in Section 2.1). In case of crisp partition, if attribute B value is less than or equal to a cut point or threshold τ then it belongs to block B 1 else it belongs to block B 2 as shown in figure 1(a). In case of soft partition, if attribute B value is less than or equals to τw/2 then it belongs to fuzzy block B 1 called with a linguistic variable “short” else if its value is greater than or equal to τ + w/2 then it belongs to fuzzy block B 2 called with a linguistic variable “tall”, otherwise it belongs to both fuzzy blocks “short” and “tall” with some fuzzy membership values as shown in figure 1(b) where w is overlapping width of two fuzzy blocks “short” and “tall”.

The fuzzy membership value can be calculated as follows. Let \(\mu _{B_{j}}(x_{i})\) represents the fuzzy membership of instance x i to fuzzy block B j . If an instance x i belongs to fuzzy block “short” (shown in figure 1(b)) then the fuzzy membership value μ s h o r t (x i ) is calculated as given below.

$$\mu_{short}(x_{i}) =\left\{ \begin{array}{ll} 1 & \text{ if}\; x_{i} \leq \tau-w/2, \\ \frac{ (\tau+\frac{w}{2}) - x_{i} }{w} & \text{ if} \tau-\frac{w}{2} < x_{i} < \tau+\frac{w}{2}, \\ 0 & \text{if}\; x_{i} \geq \tau+w/2. \end{array} \right. $$

Similarly, the membership value μ t a l l (x i ) of an instance x i belongs to fuzzy block “tall” (shown in figure 1(b)) is calculated as given below.

$$\mu_{tall}(x_{i}) =\left\{ \begin{array}{ll} 0 & \text{ if}\; x_{i} \leq \tau-w/2, \\ \frac{ x_{i}-(\tau-\frac{w}{2}) }{w} & \text{ if}\; \tau-\frac{w}{2} < x_{i} < \tau+\frac{w}{2}, \\ 1 & \text{if}\; x_{i} \geq \tau+w/2. \end{array} \right. $$

In the process of designing a soft decision tree, the fuzzy entropy of a training set \(\mathcal {X}\) is calculated as follows:

$$ {Entropy}_{F}(\mathcal{X})= \sum\limits_{k=1}^{c} - (\mu_{p_{k}},\mathcal{X})\log_{2}(\mu_{p_{k}},\mathcal{X}), $$
(4)

where \((\mu _{p_{k}},\mathcal {X})={\sum }_{x_{i} \in c_{k}} (\mu _{B_{1}}(x_{i})+\mu _{B_{2}}(x_{i}))\) is the fuzzy proportion of instances in fuzzy set \(\mathcal {X}\) that belongs to class c k . After that, the weighted average of fuzzy entropy or fuzzy info for fuzzy partition of tuples in \(\mathcal {X}\) over the values of an attribute B using the optimal threshold τ k is given below.

$$ {Info}_{F}(B,\tau_{k},\mathcal{X})=\frac{N_{F}^{\mathcal{X}_{1}}}{N_{F}^{\mathcal{X}}}\textit{Entropy}_{F}(\mathcal{X}_{1})+\frac{N_{F}^{\mathcal{X}_{2}}}{N_{F}^{\mathcal{X}}}{Entropy}_{F}(\mathcal{X}_{2}), $$
(5)

where \({Entropy}_{F}(\mathcal {X}_{1})\) and \({Entropy}_{F}(\mathcal {X}_{2})\) are the fuzzy entropy of subsets \(\mathcal {X}_{1}\) and \(\mathcal {X}_{2}\) respectively, \(N_{F}^{\mathcal {X}}={\sum }_{i=1}^{|\mathcal {X}|}(\mu _{B_{1}}(b_{i})+\mu _{B_{2}}(b_{i}))\), \(N_{F}^{\mathcal {X}_{1}}={\sum }_{i=1}^{|\mathcal {X}|}\mu _{B_{1}}(b_{i})\), \(N_{F}^{\mathcal {X}_{2}}={\sum }_{i=1}^{|\mathcal {X}|}\mu _{B_{2}}(b_{i})\). Similar to the standard decision tree, the fuzzy information gain of an attribute B using a threshold τ k can be calculated as given below.

$$ G_{F}(B, \tau_{k})=\textit{Entropy}_{F}(\mathcal{X})-\textit{Info}_{F}(B,\tau_{k},\mathcal{X}). $$
(6)

Similarly we find fuzzy information gain for all attributes in \(\mathcal {X}\), then we choose the attribute which has high information gain as the good splitting attribute at the root node. The above process is recursively repeated on the obtained child nodes from the root node, to enhance the soft decision tree.

Figure 2 shows a soft decision tree model, it has two test nodes with attributes A and B and three terminal nodes. The inference of a soft decision tree can be discussed as follows. To classify an unseen instance in this soft decision tree model the matching fuzzy membership values of the instance to each node from root to leaf are calculated. Let x i be the instance to be classified to one of the class labels C 1 and C 2, then the fuzzy membership value of x i belonging to each class can be calculated as given below.

  • The fuzzy membership value π 1 of the instance x i to classify to class C 1 is \(\mu _{A_{1}}(x_{i})\otimes \mu _{B_{1}}(x_{i})\). (where ⊗ is the fuzzy product operation.)

  • The fuzzy membership value π 2 of the instance x i to classify to class C 2 is either \(\mu _{A_{1}}(x_{i})\otimes \mu _{B_{2}}(x_{i})\) or \(\mu _{A_{2}}(x_{i})\).

Figure 2
figure 2

Soft decision tree.

The fuzzy product operation of \(\mu _{A_{1}}(x_{i})\) and \(\mu _{B_{1}}(x_{i})\) denoted \(\mu _{A_{1}}(x_{i})\otimes \mu _{B_{1}}(x_{i})\), is \(min(\mu _{A_{1}}(x_{i}), \mu _{B_{1}}(x_{i}) )\).

If the instance x i is belonging to two classes C 1 and C 2 with fuzzy membership values π 1 and π 2 respectively then class C 1 is assigned to x i if π 1>π 2, otherwise class C 2 is assigned.

4 A randomized soft decision tree

In this section, the proposed method, a randomized soft decision tree classification model is described and also an ensemble of randomized soft decision trees is presented.

In this method, several cut points are considered for each attribute in the induction process of a decision tree, whereas in the existing method a single cut point is used [21]. In the induction process, at a node, the attribute and its cut point are randomly chosen from distributions as described below.

Let \(G_{F}^{max}(A_{i})\) be the maximum information gain for the attribute A i among its various cut points. The probability distribution over the set of attributes is defined as follows, from which randomly an attribute is chosen.

$$ P(A_{i}) = \frac{G_{F}^{max}(A_{i})}{{\sum}_{k=1}^{d} G_{F}^{max}(A_{k})}. $$
(7)

Among various cut points of the attribute A i , the cut point that is chosen is randomly selected from the distribution where the probability of choosing a cut point that corresponds to threshold τ k is

$$ P(\tau_{k}) = \frac{G_{F}(A_{i},\tau_{k})}{{\sum}_{\forall l} G_{F}(A_{i},\tau_{l})}. $$
(8)

To avoid over-fitting, pruning is done, which could be either prepruning or postpruning [1]. The paper uses prepruning, where the building process is terminated as soon as the resulting error falls below a prespecified threshold. The error threshold used is chosen by using a three-fold cross validation from {0.1,0.2,0.3,0.4,0.5}.

4.1 An ensemble of randomized soft decision trees

There are two frameworks to build an ensemble of classifiers which are, based on a dependent framework or an independent framework. In the dependent one, the classifiers are dependent so that the output of one classifier is used in the design process of the next one (the good example is Adaboost). Alternatively, parameters used to build the classifier are different from the other components, so that each classifier is independent from the other one(the good example is Bagging) [48].

Let \(\mathcal {X}\) be the given training set having n tuples. The bootstrap method is applied to derive the sub-training set \(\mathcal {X}_{i}\) (drawn using sampling with replacement). Let T 1,T 2,…, T l be the learned soft decision trees in the ensemble method as discussed in Section 4. Each node is built based on a bootstrapped training set. The parameter l is chosen from a three-fold cross validation from {1,3,5,7,9,11,13,15}. For the given query pattern Q, the proposed ensemble model takes output of each randomized soft decision tree T i , for i=1,2,…,l and outputs the class label based on majority voting (either 1 or 0) which is assigned to the query pattern Q.

5 Data sets

Three data sets are used in the experimental study.

  1. 1.

    1999 KDD Cup data set, which is originated by MIT Lincon labs [49]. Since 1999, KDD Cup data set has been widely for the evaluation of intrusion detection systems. And it was prepared based on the data captured from 1998 DARPA IDS [50] evaluation program. The 1999 KDD Cup training data set was derived from around 4,900,000 connections, each connection is represented as a vector having 41 features, and class information is labeled as either normal or an anomalous one. Among 41 features we used continuous attributes only for our experiments. And the detailed explanation of each continuous feature is given in [51].

  2. 2.

    Spam mail data set collected at Hewlett-Packard Labs. Totally it has 4601 instances with 57 continuous attributes and a nominal class label which categorizes the email as either a spam one or not. Its documentation and data sets are available at the UCI Machine Learning Repository [52].

  3. 3.

    Pima Indians Diabetes Database originated by National Institute of Diabetes, Digestive and Kidney Diseases. This data set is having 768 instances, each with eight attributes. All eight attributes are continuous valued attributes and details can be found at UCI Machine Learning Repository [52].

6 Experimental results and discussion

In this section, we discussed the results of the proposed model called ensemble of randomized soft decision trees for robust classification. The performance of the proposed model is compared against various existing models in terms of accuracy and standard deviation. we used C4.5 package for standard decision tree invented by Quinlan [53]. To evaluate various methods used in this paper, we injected noise at various levels ranging from 1% to 6% in three specified data sets. Table 2 and table 3 have shown the comparison of experimental results over the specified data sets before injecting noise and after injecting noise respectively. Figure 3, figure 4 and figure 5 shows the results of PIMA, SPAMMAIL and 1999KDDCUP data sets respectively. It is clear from the results when the percentage of noise is increased the performance of standard decision tree is decreased abruptly for PIMA and SPAMMAIL data sets, whereas the performance is increased up to some extent and decreased later for 1999KDDCUP data set. For the existing soft decision tree model, proposed randomized soft decision tree model and an ensemble of randomized soft decision tree model, the performance is increased first and later decreased slowly.

Figure 3
figure 3

Experimental results over PIMA dataset.

Figure 4
figure 4

Experimental results over SPAM MAIL dataset.

Figure 5
figure 5

Experimental results over 1999KDDCUP dataset.

Table 2 Experimental results before injecting noise.
Table 3 Experimental results after injecting noise.

Experimental results show that our proposed randomized soft decision tree model and an ensemble of randomized soft decision trees perform better for the standard data sets and also more robust to noise than the remaining methods.

6.1 Complexity of randomized soft decision tree

In this section, the complexity of the proposed method, “A Randomized Soft Decision Tree” is discussed particularly with a single test node, having d number of attributes and n number of tuples. The complexity of an exhaustive search to find optimal cut point requires n−1 evaluations for each attribute and it could be expensive especially as n increases.

The heuristic search of the proposed model needs to examine only class boundary cut points instead of n−1 cut points of each attribute. For k-class problem, when all instances are arranged in the sorted sequence, where all instances of the same class are adjacent to each other, in the best case k−1 evaluations are used to find the optimal cut point of an attribute. In the worst case, where the class changes from one instance to another, n−1 evaluations are needed for each attribute.

6.2 CPU times

Table 4 gives the idea of computational CPU times of various methods discussed in this paper. These times are recorded on Fedora platform, Intel Core i3 processor with 2.40 GHz, 4GB RAM. The observations from table 4 are, the CPU time is increased for standard decision tree (SDT) in all three datasets used in the experiments as the size of training set (TRSize) increases, whereas for the proposed randomized soft decision tree and existing soft decision tree models CPU time is more or less equal as the training set size is increased. For the proposed ensemble method, where each component is derived using the proposed approach i.e., randomized soft decision tree model and its CPU time is increased as the size of the training set increases.

Table 4 Comparing CPU times.

Hence the computational costs involved in the proposed soft decision tree model are better than the existing standard decision tree and it is more or less equal to the existing soft decision tree model.

7 Conclusion

In this paper, the fuzzy set theory is combined with standard decision tree classification to build a randomized soft decision tree model and also an ensemble of randomized soft decision trees for robust classification is presented. For an improvement, instead of information gain as the goodness measure, the parameters like splitting attribute, cut point are randomly chosen from the probability distribution of fuzzy information gain. Experimental results over three standard data sets have shown that the proposed ensemble method and a randomized soft decision tree has outperformed and also more robust classification than the related soft decision tree and also the standard decision tree especially in the presence of noise.