Improving lazy decision tree for imbalanced classification by using skew-insensitive criteria

Su, Chong; Cao, Jie

doi:10.1007/s10489-018-1314-z

Improving lazy decision tree for imbalanced classification by using skew-insensitive criteria

Published: 25 October 2018

Volume 49, pages 1127–1145, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Improving lazy decision tree for imbalanced classification by using skew-insensitive criteria

Download PDF

Chong Su¹ &
Jie Cao³^nAff2

644 Accesses
13 Citations
Explore all metrics

Abstract

Lazy decision tree (LazyDT) constructs a customized decision tree for each test instance, which consists of only a single path from the root to a leaf node. LazyDT has two strengths in comparison with eager decision trees. One is that LazyDT can build shorter decision paths than eager decision trees, and the other is that LazyDT can avoid unnecessary data fragmentation. However, the split criterion used for constructing a customized tree in LazyDT is information gain, which is skew-sensitive. When learning from imbalanced data sets, class imbalance impedes their ability to learn the minority class concept. In this paper, we use Hellinger distance and K-L divergence as split criteria to build two types of lazy decision trees. An experimental framework is performed across a wide range of imbalanced data sets to investigate the effectiveness of our methods when comparing with the other methods including lazy decision tree, C4.5, Hellinger distance based decision tree and support vector machine. In addition, we also use SMOTE to preprocess the highly imbalance data sets in the experiment and evaluate its effectiveness. The experimental results, which contrasted through nonparametric statistical tests, demonstrate that using Hellinger distance and K-L divergence as the split criterion can improve the performances of LazyDT for imbalanced classification effectively.

A Robust Classifier for Imbalanced Datasets

Addressing Local Class Imbalance in Balanced Datasets with Dynamic Impurity Decision Trees

One-against-all-based Hellinger distance decision tree for multiclass imbalanced learning

Article 01 February 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Decision trees are among the more popular classification methods due to their efficiency, simplicity and interpretability. Algorithms for constructing decision trees such as C4.5 [1] create a single decision tree during the training phase and then use the tree to classify test instances. In contrast, the lazy decision tree (LazyDT) builds a customized tree for each test instance, which consist of only a single path from the root to leaf node [2]. There are two strengths in comparison with eager decision trees. Firstly, the decision paths built by LazyDT are often shorter than paths of eager decision trees. Secondly, LazyDT can overcome the data fragment problem [3]. However, LazyDT has not taken into account the imbalanced data in which one class (the majority class) vastly outnumbers the other (the minority class). Due to the natures of learning algorithms, learning from imbalanced data sets is still a challenging problem and exists widely in many domains including, but not limited to, fraud detection [4], network intrusion [5], medical diagnosis [6] and so on. In order to overcome the class imbalance problem, many methods have been proposed and can be roughly divided into three categories as follows [7,8,9]. (1) Data-level methods often concentrate on modifying the training set until all the classes are approximately equally represented. The most common methods employed are undersampling of the majority class and oversampling of the minority classes. Since undersampling of the majority class does not take full advantage of the useful information from the majority class, oversampling has received more attention. A very popular approach is SMOTE (Synthetic Minority Oversampling TEchnique), which increases diversity by generating pseudo minority class data [10]. Depending on the technique of how synthetic samples will be generated, Several variants of SMOTE have been proposed such as Borderline-SMOTE [11], Adaptive Synthetic Sampling Technique (ADASYN) [12], MSMOTE [13], and MWMOTE [14]. In addition, when a dataset is highly dimensional and highly class imbalanced, existing online feature selection algorithms usually ignore the small classes which can be important in these applications. It is hence a challenge to select the features from highly dimensional and class imbalanced data in an online manner. A new Online Feature Selection based on the Dependency in K nearest neighbors (K-OFSD) uses the information of nearest neighbors to select relevant features [15]. (2) Algorithm-level methods directly modify existing algorithms to alleviate the bias towards the majority class. The most common methods modified are decision trees and support vector machines (SVMs). For decision trees, the main approach is to use the skew-insensitive split criteria to generate the decision tree, which can be seen from the next paragraph for details. For SVMs, one approach is to adjust the class boundary towards the majority class based on the kernel-alignment ideal [16]; the other is to modify twin support vector machine (TSVM) such as maximum margin of twin spheres support vector machine (MMTSSVM) and maximum margin of twin spheres support vector machine with pinball loss (Pin-MMTSSVM). MMTSSVM finds two homocentric spheres by solving a quadratic programming problem and a linear programming problem. The small sphere contains as many positive samples as possible, while most negative samples are pushed outside the large sphere [17]. In contrast with MMTSSVM, Pin-MMTSSVM uses pinball loss instead of hinge loss used in MMTSSVM to improve the generalization performance of MMTSSVM [18]. (3) Hybrid methods can combine the advantages of two previous groups. One strategy is to resample the dataset and then use the standard classifiers such as decision trees, Naive Bayes and SVM etc. The other is combine pre-processing with bagging and boosting. SMOTEBagging [19], SMOTEBoost [20], EasyEnsemble [21] and BalanceCascade [21] are examples of this type of approaches. By contrast, the Hybrid methods have more advantages and vaster development space [22]. Moreover, multi-class imbalance classification is widely applied in many areas. In recent years, a new ensemble learning is proposed to tackle the multi-class imbalance classification, which uses OVO (one vs one) decomposition scheme to divided m-class imbalance classification problem into m(m − 1)/2 binary subproblems and utilizes a confidence degree matrix to finish the aggregation [23].

In terms of decision trees, the split criteria used in decision trees such as Gini index, information gain and DKM have been proven to be skew-sensitive and the Hellinger distance is proposed as a split criterion to build Hellinger distance decision trees (HDDT) [24, 25]. Meanwhile, the variants of HDDT are proposed to deal with different problems. For example, the Multi-Class HDDT (MHDDT) employs the techniques similar to Error Correcting Output Codes Decomposition (ECOC) to deal with the multi-class imbalance classification [26]; Gaussian Hellinger Very Fast Decision Tree (GH-VFDT) computes the Hellinger distance between two normal distributions P and N straightforwardly, which means it can deal with imbalanced data streams [27]. As for LazyDT, it adopts information gain to build the decision path, which impedes the ability of LazyDT to learn the minority class concept. So for all of these reasons, we use Hellinger distance and K-L divergence as split criteria to build LazyDT respectively, namely lazy decision tree based on Hellinger distance (HLazyDT) and lazy decision tree based on K-L divergence (KLLazyDT). An experimental framework is performed across a wide range of imbalanced data sets to investigate the effectiveness of our methods when comparing with the other methods including lazy decision tree, C4.5, Hellinger distance based decision tree and support vector machine. The data sets used in this experiment are categorized into two groups: highly and lowly imbalanced data sets according to imbalanced ratio. In addition, we also use SMOTE [10] to preprocess the highly imbalanced data sets in the experiment and evaluate its effectiveness since it has become the de facto standard for improving the performances of the decision tree algorithms [28]. The experimental results, which are contrasted through nonparametric statistical tests, demonstrate that using Hellinger distance and K-L divergence as the split criterion to build LazyDT can improve the performances of LazyDT for imbalanced classification effectively.

In summary, the key contributions of this paper are as follows. (1) We analyse the reason why the split criterion used in LazyDT is skew-sensitive. (2) We use Hellinger distance and K-L divergence as the split criteria used in LazyDT, which demonstrates that the modification can improve the performances of LazyDT for imbalanced classification effectively. (3) We investigate the effectiveness of the SMOTE preprocessing.

The rest of this paper is organized as follows. In Section 2, some related works are introduced. Section 3 provides the formulations and some related properties of Hellinger distance and K-L divergence respectively. Section 4 describes Hellinger distance based lazy decision tree (HLazyDT) and K-L divergence based lazy decision tree (KLLazyDT). Section 5 states the used experimental framework and the analysis of experimental results. Finally, the conclusions obtained in this work are shown in Section 6.

2 Related work

As with many lazy algorithms, the first part of training phase is non-existent and all the work of LazyDT is done during the classification of a given test instance. LazyDT, which gets the test instance as part of the input, follows a separate and classify methodology: a feature is chosen and the sub-problem containing the instances with the same feature value as the given instance is then solved recursively. The overall effect is that of tracing a path in an imaginary tree built for the test instance. The heart of LazyDT is information gain that is used as the split criterion to select the split feature. Unlike eager decision trees, LazyDT takes into account the information of not only the training instances but also the test instance. Essentially, LazyDT builds a single decision path for a given test instance, which can avoid splits on the features that are irrelevant for the specific instance. Therefore, LazyDT avoids unnecessary data fragmentation and duplicate subtrees which are shown in Fig. 1 and may produce a shorter and more accurate decision path for the specific instance. On the other hand, three problems need to be solved in the framework of LazyDT. The first is the way to handle missing values. LazyDT will ignore the missing feature and just never branch on a missing value in the test instance. The second problem is that the information gain may be negative or zero. In this case, LazyDT will return the most frequent class in the training set which covered by the current node. The last problem is that there may be no training instances with the same feature value as the given instance. At this time, LazyDT will remount to the parent node and return the most frequent class in the training set. At last, LazyDT is also applied to distributed privacy preserving area in recent years from an application point of view [29].

Since Boosting can improve the performance of various decision trees effectively [30, 31], Fern X Z and Brodley C E proposed a relevance-based boosting lazy decision trees which consist of two steps [32]. Firstly, it builds a customized LazyDT ensemble for each test instance. All the training instances are equally weighted at the beginning. In each iteration, a decision path for the given test instance is produced by applying LazyDT to the training set with instance weights. The instance weight is then adjusted to form a new distribution for the next iteration according to how relevant this instance is to classifying the given test instance and whether its class label is predicted correctly by the current decision path. The relevance level is the depth of the node at which the training instance is discarded when producing the given decision path. Secondly, the relevance-based boosting lazy decision trees also adopt a distance-based pruning strategy to address the problem of over-fitting. In each iteration, the pruning strategy performs a greedy search to select a deletion of feature in the light of the ratio of the heterogeneous distance to the homogeneous distance. The heterogeneous distance measures the distance between the test instance and the closest cluster of instances from the other classes, while the homogeneous distance is defined as the distance between the test instance and the most frequent class in the current set. Although an empirical comparison to the other boosted regular decision trees shows that the relevance-based boosting lazy decision trees achieve comparable accuracy and better comprehensibility, the distance-based pruning strategy is biased against the minority class. Thus, it is difficult for the relevance-based boosting lazy decision trees to deal with the imbalanced data sets.

Due to the fact that LazyDT needs to be re-run independently for each test instance, the main drawback of it is its high computational cost in the classification phase. In order to reduce the computational cost of LazyDT, a batched lazy decision tree was introduced by Guillame-Bert M and Dubrawski A [33]. The batched lazy decision tree substitutes a subtree for a single path and each necessary node is visited only once through a single pass. This algorithm uses the value of the test instance on the split feature to filter the training set and the test set simultaneously in each iteration. Once the termination condition is satisfied, the filtered test instances are labeled as the most frequent class.

In summary, these algorithms mentioned above do not take into account imbalanced learning. The framework of them all takes advantage of information gain as split criterion. It is well known that information gain is a measure of impurity. As far as the imbalanced data sets are concerned, the distributions of the classes are unbalanced and it means that the impurity of imbalanced data is lower than balanced data. When using information gain as the split criterion to choose the feature in a imbalanced data set, the information gain may be zero or even negative, which will lead to cessation of tree growth. In this way, it is unreasonable that all of the test instances may be classified as the majority class, which decreases the performances of the classifier. This is the reason why information gain is skew-sensitive.

3 Hellinger distance and K-L divergence

3.1 Hellinger distance

Hellinger distance is one of measures which reflects divergences between two probability distributions [34]. Given a countable space Φ, let P and Q be two distributions and Hellinger distance can be defined as follows.

$$ D_{H}(P,Q)= \sqrt{\sum\limits_{\phi \in {\Phi}}(\sqrt{P(\phi)} - \sqrt{Q(\phi)})^{2}}\quad. $$

(1)

The Hellinger distance carries the following properties.

(1)
D_H(P,Q) is in [0,$\sqrt {2}$].
(2)
The Hellinger distance is symmetric and non-negative.
(3)
The bigger Hellinger distance is, the better discriminations of the probabilities are.

The Hellinger Distance with respect to a feature for two-class problem is given as follows, which is firstly used as a split criterion in decision trees [24].

$$ D_{H}(X_{+},X_{-})=\sqrt{\sum\limits_{j = 1}^{P}(\sqrt{\frac{\vert X_{+j} \vert}{\vert X_{+} \vert}} - \sqrt{\frac{\vert X_{-j} \vert}{\vert X_{-}\vert}})^{2}}\quad, $$

(2)

where |X₊| indicates the number of instances that belong to the minority class in the training set and |X_+j| specifies the subset of the training set with the minority class and value j for the feature X. Similarly, they are the same explanations of |X₋| and |X_−j| but for the majority class. In addition, P is the number of different values in the feature X. Since (2) is not influenced by prior probability, it is insensitive to class distributions.

3.2 K-L divergence

Kullback-Leibler Divergence is an asymmetric measure which reflects divergences between two probability distributions and generally abbreviates to K-L divergence [35]. Suppose Φ is a countable space and both P and Q are two probability distributions in Φ respectively. $D_{kl}(P,Q)={\sum }_{\phi \in {\Phi }} \left (\ln (\frac {P(\phi )}{Q(\phi )}) \cdot P(\phi )\right )$ indicates the K-L divergence of Q from P. Similarly, the K-L divergence of P from Q can be defined as $D_{kl}(Q,P)={\sum }_{\phi \in {\Phi }} \left (\ln (\frac {Q(\phi )}{P(\phi )}) \cdot Q(\phi )\right )$. In order to make K-L divergence satisfy the symmetry, a symmetrised form of K-L divergence is defined as follows.

$$\begin{array}{@{}rcl@{}} D_{kl-symmetry}(P,Q)&=&\frac{1}{2} \cdot D_{kl}(P,Q) + \frac{1}{2} \cdot D_{kl}(Q,P) \\ &=&\frac{1}{2} \cdot \sum\limits_{\phi \in {\Phi}}\left( \ln(\frac{P(\phi)}{Q(\phi)}) \cdot P(\phi) \right)\\ &&+ \frac{1}{2} \cdot \sum\limits_{\phi \in {\Phi}} \left( \ln(\frac{Q(\phi)}{P(\phi)}) \cdot Q(\phi)\right) \\ &=&\frac{1}{2} \cdot \sum\limits_{\phi \in {\Phi}}\left( \left( \ln P(\phi) - \ln Q(\phi)\right)\right. \cdot P(\phi)\\ &&+\left.\left( \ln Q(\phi) - \ln P(\phi)\right) \cdot Q(\phi) \right) \\ &=&\frac{1}{2} \cdot \sum\limits_{\phi \in {\Phi}}\left( \left( P(\phi)\right.\right.\\ &&\left.-~Q(\phi)\right) \cdot \ln\left.\left( \frac{P(\phi)}{Q(\phi)}\right)\right). \end{array} $$

(3)

Especially, when using the symmetrised form of K-L divergence as the splitting criterion of decision trees for two-class problem, we can ignore the constant term namely $ \frac {1}{2} $ and rewrite the (3) as follows.

$$ D_{kl-symmetry}(X_{+},X_{-})=\sum\limits_{j = 1}^{P}\left( \left( \frac{\vert X_{+j} \vert}{\vert X_{+} \vert} - \frac{\vert X_{-j} \vert}{\vert X_{-}\vert}\right) \cdot \ln\left( \frac{\frac{\vert X_{+j}\vert}{\vert X_{+}\vert}} {\frac{\vert X_{-j}\vert}{\vert X_{-}\vert}}\right) \right). $$

(4)

Since all symbols used in the (4) have the same illustrations as the (2), we will not repeat them here and please refer to the Section 3.1 for details. It essentially reflects the divergence between the feature value distributions with respect to two different classes and has the following properties.

(1)
D_{kl−symmetry}(X₊,X₋) is bounded in [0, + ∞).
(2)
D_{kl−symmetry}(X₊,X₋) satisfy symmetry and non-negativeness.
(3)
The bigger D_{kl−symmetry}(X₊,X₋) is, the better discriminations of the feature are.

However, there exists zero probability problem which makes the upper bound of the (4) infinite. Smoothing methods are generally used to avoid this problem. In this paper, we adopt Laplace smoothing namely additive smoothing. For example, assuming |X_+j| = 0 and there are m_j different values in the feature X, the smoothed conditional probability is defined as follows.

$$ P(X_{j} \vert +)=\frac{\vert X_{+j} \vert + 1}{\vert X_{+} \vert + \lambda \cdot m_{j}}, $$

(5)

where λ is the smoothing parameter. Here, we set λ = 1, that is to say, Laplace’s law of succession [36]. In this way, we can use the (5) to tackle zero probability problem.

4 Hellinger distance and K-L divergence based lazy decision trees

Although LazyDT avoids unnecessary data fragmentation and duplicate subtrees and may produce a shorter and more accurate decision path for the specific instance, it is still difficult for LazyDT to deal with imbalanced problem. There are two main defects which impede LazyDT’s ability to learn from imbalanced data. one is that information gain used by LazyDT is skew-sensitive and the other is that LazyDT do not take into account the value of the test instance on the alternative feature in the procedure of selecting the split feature, which may give rise to the problem that there are no instances in the new node when filtering the training set by the value of the test instance on the split feature. In this way, LazyDT will cease to split and return the most frequent class in the training set covered by the current node, which degrades the classification performance of LazyDT. In order to overcome the two defects mentioned above, we use the value of the test instance on the candidate feature to modify Hellinger distance and K-L divergence respectively.

4.1 Hellinger distance based lazy decision trees (HLazyDT)

For the feature X, let v represent the value of a given test instance. Given two-class problem, we can calculate the Hellinger distance of the value v on the feature X as follows.

$$ W_{H-v}(X_{+},X_{-})=\left| \sqrt{\frac{\vert X_{+v} \vert}{\vert X_{+} \vert}} - \sqrt{\frac{\vert X_{-v} \vert}{\vert X_{-}\vert}} \right| \quad , $$

(6)

where |X₊| indicates the number of examples that belong to the minority class in the training set and |X_+v| specifies the subset of training set with the minority class and the value v for the feature X. Similarly, they are the same explanations of |X₋| and |X_−v| but for the majority class. The bigger is W_H−v, the better discriminations of the value v on the feature X are. It reflects the contribution of the value v to the discrimination of the feature X.

As mentioned above, we combine the (2) with the (6) and give the formula of Hellinger distance based split criterion used in HLazyDT as follows.

$$ Split_{H}=\sqrt{W_{H-v}(X_{+},X_{-}) \cdot D_{H}(X_{+},X_{-})} \quad , $$

(7)

where D_H(X₊,X₋) and W_H−v(X₊,X₋) can be calculated according to the (2) and the (6) respectively. Especially, when the value v of the test instance on the feature X does not belong to the set of the values on the feature X in the training set, both the (6) and (7) will be equal to zero. In fact, the (7) calculates the ability of the feature X to discriminate the two classes based not only on all values of the feature X but also on the value of the test instance on the feature X.

The following Algorithm 1 outlines the approach for calculating Hellinger distance and the procedure for inducing HLazyDT. In this algorithm, T_{y = i} indicates the subset of the training set T with class i, T_{f = j} specifies the subset with the value j for the feature f and T_{f = j,y = i} identifies the subset with the class i and has the value j for the feature f. Given two-class problem, class i is drawn from some finite set of classes like the minority class(+) and the majority class(-).

4.2 K-L divergence based lazy decision trees (KLLazyDT)

Similarly, we firstly calculate K-L divergence of the value v as follows.

$$ W_{KL-v}(X_{+},X_{-})=\left( \frac{\vert X_{+v} \vert}{\vert X_{+} \vert} - \frac{\vert X_{-v} \vert}{\vert X_{-}\vert}\right) \cdot \ln\left( \frac{\frac{\vert X_{+v}\vert}{\vert X_{+}\vert}} {\frac{\vert X_{-v}\vert}{\vert X_{-}\vert}}\right) , $$

(8)

where all symbols have the same illustrations as the (6) and please refer to the Section 4.1 for details. It reflects the discriminations of the value v of the test instance on the feature X. Next, we describe the formula of K-L divergence based split criterion used in KLLazyDT as follows.

$$ Split_{KL}=\sqrt{W_{KL-v}(X_{+},X_{-}) \cdot D_{kl-symmetry}(X_{+},X_{-})} \quad , $$

(9)

where D_{kl−symmetry}(X₊,X₋) and W_KL−v(X₊,X₋) can be calculated according to the (4) and (8) respectively. Likewisely, if the value v of the test instance on the feature X does not belong to the set of the values on the feature X in the training set, both the (4) and (8) will also be equal to zero. Moreover, it is worth mentioning that we still need use the (5) to smooth the (4) and (8) when zero probability problems come out. The framework of the K-L divergence based lazy decision trees is described in Algorithm 2. Please refer to Algorithm 1 for the illustrations of the related symbols.

Finally, we focus on the time and memory complexity of eager decision trees and lazy decision trees, which have been discussed in [33]. In order to make this paper self-contained for all its related information, we just restate the related discussions. For convenience of discussion, we consider binary splitting for decision trees, a training dataset with n instances and f attributes as well as a test dataset with t instances and f attributes. We assume that there are a minimum of p training instances in expanding nodes. (1) The time complexity of an eager decision tree model is $O({\sum }_{i = 0}^{\lg \frac {n}{p}} 2^{i} \cdot f \cdot \frac {n}{2^{i}})= O(n \cdot f \cdot \lg \frac {n}{p})$ and the time complexity of a lazy decision tree model is $O(t \cdot {\sum }_{i = 0}^{\lg \frac {n}{p}} f \cdot \frac {n}{2^{i}})= O(t \cdot f \cdot n) $ on average. Note that $ \lg \frac {n}{p} $ is the average depth of exploration, 2ⁱ is the average number of nodes at depth i, and $ \frac {n}{2^{i}} $ is the average number of training instances contained in a node at depth i. Thus, unless the number of unlabeled observations is small, lazy decision trees can be slower in prediction mode than the equivalent eager decision trees. (2) Building an eager decision tree requires memories for the stack operations used during the recursive construction of the tree as well as some storages for the final trained model, while Lazy decision trees only require the stack. The stack size and model size of decision trees are $O({\sum }_{i = 0}^{\lg \frac {n}{p}} \frac {n}{2^{i}}) = O(n)$ and $ O(\frac {n}{p})$ (the number of non-leaf nodes in a decision tree) on average respectively. According to the notation mentioned above, the required memories of an eager decision tree is $ O(n) + O(\frac {n}{p}) $ on average in the training phase and $ O(\frac {n}{p}) $ in the testing phase, while Lazy decision trees only require O(n). It means that eager decision tree requires more memories in the training phase than lazy decision trees, while in the testing phase it yet requires fewer memories than lazy decision trees.

5 Experiments

The whole experiment is divided into the following steps. Firstly, we categorize the experimental data sets into three groups: lowly (IR < 9) imbalanced data sets, highly (IR ≥ 9) imbalanced data sets and highly imbalanced data sets with SMOTE preprocessing. Secondly, we use uniform-frequency method to discretize the datasets used in the experiment. Finally, we use five-fold cross-validation to train and test the methods which participate in the experiment and obtain and analyze the experimental results. In order to depict the whole procedure of our experiment clearly, we give the whole framework of our experiment as shown in Fig. 2. The methods’ abbreviations in Fig. 2 are shown in Table 1.

Table 1 All methods used in the experiment and their parameters settings and abbreviations

Full size table

5.1 Experiment tool

In this section, we present the set up of the experimental framework that is used to develop the analysis of our proposals. Moreover, we use KEEL (Knowledge Extraction based on Evolutionary Learning)^{Footnote 1} to finish the whole experiment, which empowers the user to assess the behavior of evolutionary learning and soft computing based techniques for all kinds of data mining problems: regression, classification, clustering, pattern mining and so on. KEEL is an open source Java framework (GPLv3 license) that provides a number of modules to perform a wide variety of data mining tasks. It includes tools to perform data management, add a new algorithm, design of multiple kind of experiments, statistical analyses, etc and provides a simple GUI to design experiments with different data sets and computational intelligence algorithms in order to assess the behaviour of the algorithms. The latest version of it is KEEL3.0, which includes new modules for semi-supervised learning, multi-instance learning, imbalanced classification and subgroup discovery [37]. In our experiment, We first use KEEL3.0 to implement Hellinger distance based lazy decision tree and K-L divergence based lazy decision tree respectively and then use the imbalanced classification and datasets provided by KEEL3.0 to finish the whole experiment.

5.2 Methods

In this experiment, we compare Hellinger distance based lazy decision tree (HLazyDT) and K-L divergence based lazy decision tree (KLLazyDT) with the other methods including lazy decision tree (LazyDT), C4.5, Hellinger distance based decision tree (HDDT) and support vector machine (SVM) across lowly and highly imbalanced data sets. Thereinto, C4.5 is a popular algorithm for eager decision trees which was used in the experiment of lazy decision tree [1]. Since imbalanced classification is the focus of this experiment, we use the version of C4.5 with unpruning in this experiment, which was proven to be an effective strategy [38]. Similarly, HDDT is an eager decision tree which is designed for imbalanced classification [24, 25]. Additionally, SVM is a popular approach for data classification and has been successfully applied in various applications [39]. Since SVM does not belong to the family of decision trees, we only use SVM as the comparison object in the experiment and the statistical analysis of the experimental results will not include it. Finally, all methods used in this experiment, their parameters setting as well as their abbreviations are shown in Table 1. We will use these abbreviations of all methods to denote for the rest of this paper.

5.3 Datasets and data partitions

In this experiment, we select data sets from KEEL data set repository^{Footnote 2} and categorized these data sets into two groups: lowly (IR < 9) and highly (IR ≥ 9) imbalanced data sets according to the ratio of the number of examples from the majority class to the minority class(IR). Although there is no consensus in the literature about when a data set is considered highly imbalanced, we will consider that the ratio of the number of examples from the majority class to the minority class (IR) above 9 represents a highly imbalanced data set in this paper due to the fact that ignoring the minority class examples by a classifier still obtains an error of 0.1 in accuracy. The characteristics of data sets are summarized in Tables 2 and 3 where we denote the number of examples, the number of minority class examples, the number of features, the number of continuous features, the number of discrete features and imbalanced ratio by “Examples”, “Minority”, “Features”, “Continuous”, “Discrete” and “IR” respectively. These data sets are two-class problem.

Table 2 Lowly imbalanced data sets used in the experiment

Full size table

Table 3 Highly imbalanced data sets used in the experiment

Full size table

On the other hand, when evaluating each of these approaches on the data sets, we firstly make use of the equal-frequency bins to discretize all used data sets since the family of the lazy decision trees only deals with the discrete data. Secondly, in order to explore what effect SMOTE [10] have on these methods used in this experiment, we also use SMOTE to preprocess the highly imbalanced data sets and compare the performances of these methods. Since our goal is to explore SMOTE to obtain the balanced data sets, the SMOTE levels (SL) for minority class can be calculated as follows.

$$ SL=(\frac{n - n_{+}}{n_{+}} -1) \times 100\%. $$

(10)

Finally, we performed five-fold cross-validation. In this procedure, each data set is randomly partitioned into five equal sized and disjoint subsets, a single subset is retained as the validation data for testing the model and the remaining four subsets are used as the training data. The cross-validation process is then repeated five times and the mean of five repetitions is considered as the final results.

5.4 Evaluation measure

Since the behavior of different evaluation measures can result in potentially different conclusions, it is suggested to use more evaluation measures to evaluate classifier especially when learning from imbalanced datasets [40]. Therefore, we employee the area under the receiver operating characteristic(ROC) curve and G-mean to compare the performance of all methods used in this experiment respectively which are two popular evaluation measures for imbalanced problem.

(1)
The area under the ROC Curve(AUC) can be calculated as follows [41].
$$ AUC=\frac{2S_{0}-n_{+}(n_{+}+ 1)}{2n_{+}n_{-}}, $$
(11)
where n₊ and n₋ indicate instances of the minority class and the majority class respectively. S₀ is the sum of ranks of the minority class instances.
(2)
G-mean is computed as the geometric mean of the true positive and true negative rates, i.e.,
$$ G-mean=\sqrt{\frac{TP}{TP+FN} \cdot \frac{TN}{TN+FP}}, $$
(12)
where TP indicates the number of true positive instances, TN denotes the number of true negative instances, FP means the number of false positive instances, and FN is the number of false negative instances.

5.5 Statistical tests for performance comparison

For the sake of the analysis of the experimental results, non-parametric tests are used for statistical comparisons of all methods used in this experiment due to the fact that the initial conditions that guarantee the reliability of the parametric tests may not be satisfied [42, 43]. In this experiment, we use adjusted p-value Holm procedure to find which methods are distinctive among a 1 × n comparison. In this procedure, we firstly obtain the average ranks of the methods according to the Friendman procedure. Especially, when in the case of a tie, the average rank is assigned to each method. For instance, if the performance of two methods tie for first, a rank of 1.5 is assigned to each, or if three tie for first, a rank of 2 is assigned to each and so on. Secondly, we need choose a base method or a control method. In our experiment, we choose the method with minimum average rank as the base method and arrange the others in ascending order according to the average ranks. The test statistics for comparing the ith with the base method is as follows.

$$ Z=\frac{R_{i} - R_{b}}{\sqrt{\frac{K(K + 1)}{6N}}}, $$

(13)

where R_i and R_b are the average ranks computed through the Friendman test for the ith method and the base method respectively. K is the number of methods to be compared and N is the number of data sets used in the comparison. The Z value is used to find the corresponding probability(P-value) from the table of normal distribution, which is then compared with an appropriate level of significance α. In adjusted P-value Holm procedure, it starts with the most significant P value. The ith adjusted P-value (APV_i) is defined as follows.

$$ APV_{i}=min\{v;1\}, where~v=max\{(k-j)P_{j},1\leq j \leq i\}. $$

(14)

If APV_i < α, the null hypothesis, namely there is no significant difference in the ranks of two methods, is rejected. As soon as a certain null hypothesis cannot be rejected, all remaining hypotheses are retained as well.

5.6 Experiment results and analysis

In this section, we present the empirical analysis of our methods including HLazyDT and KLLazyDT in order to determine their robustness in three different scenarios namely lowly imbalanced data sets, highly imbalanced data sets and highly imbalanced data sets with SMOTE preprocessing. We firstly compare the average AUCs and G-means among these different approaches and then we also compare the average ranks among these different approaches at 95% confidence interval according to AUCs and G-means respectively. If there are statistically significant difference between the base approach and the others, we put a “√” sign on the corresponding method, which are shown in Table 10.

The experimental results are shown in Tables 4, 5, 6, 7, 8, and 9 respectively. Tables 4, 6, and 8 show the AUC results for each method across three types of the datasets used in this experiment and the rest tables show the G-mean results for each method across three types of the datasets. To compare the average AUC results and G-mean results clearly, we also make the charts which are shown in Figs. 3 and 4 respectively. Finally, The observations can be described as follows.

(1)
When comparing the results across the lowly imbalanced data sets, HLazyDT and KLLazyDT outperform the other trees except for SVM according to AUC, which also can be seen in Fig. 3. As shown in Table 10, we note that HLazyDT and KLLazyDT perform statistically significantly better than LazyDT at 95% confidence interval according to the average ranks of AUCs. Similarly, when focusing on the G-mean results, we can obtain the same results as AUC which can be seen from Table 5 and Fig. 4.
(2)
When comparing the results across the highly imbalanced data sets, both Tables 6 and 7 show that HLazyDT and KLLazyDT outperform the other trees except for SVM according to AUC or G-mean. This result mentioned above also can be seen from Figs. 3 and 4. Moreover, we can find from Table 10 that HLazyDT and KLLazyDT perform statistically significantly better than the other trees at 95% confidence interval according to both the average ranks of AUC and G-mean.
(3)
When we use SMOTE to preprocess the highly imbalanced data sets, it can improve the performances of all methods used in the experiment. As shown in Table 10, HDDT performs statistically significantly better than C4.5, LazyDT and KLLazyDT at 95% confidence interval according to both the average ranks of AUC and G-mean. In addition, Tables 8 and 9 also show that HLazyDT and KLLazyDT outperform C4.5 and LazyDT except for SVM and HDDT. However, HLazyDT and KLLazyDT still achieve the comparable performance (neither significantly worse) than HDDT according to the average AUCs and G-means, which can be observed from Figs. 3 and 4 clearly.

Table 4 The AUC results across lowly imbalanced data sets

Full size table

Table 5 The G-means results across lowly imbalanced data sets

Full size table

Table 6 The AUC results across highly imbalanced data sets

Full size table

Table 7 The G-means results across highly imbalanced data sets

Full size table

Table 8 The AUC results across highly imbalanced data sets with SMOTE

Full size table

Table 9 The G-means results across highly imbalanced data sets with SMOTE

Full size table

Table 10 Average Friedman rankings and APVs of various decision trees using Holm’s procedure in ROC and G-means

Full size table

Finally, we provide the following analyses and conclusions.

(1)
It is obvious that the AUC and G-mean results not just associate with class imbalance ratio, which can be observed from Tables 4, 5, 6, 7, 8, and 9. As it turns out, data set complexity such as overlapping, lack of representative data, small disjuncts, and others is the primary determining factor of classification deterioration [7, 8].
(2)
SMOTE preprocessing is very helpful to improve the performances of all methods used in this experiments when learning from highly imbalanced data sets. The main reason is that SMOTE creates artificial data based on the feature space similarities between the existing minority class instances and helps in broadening the decision region for a classifier and improving generalization [10].
(3)
The information gain ratio used by C4.5 prefers to choose the feature with fewer values and is skew-sensitive. Thus, C4.5 can not always select the best split feature. LazyDT uses the information gain to choose the split feature, which is also skew-sensitive. Moreover, LazyDT does not take into account the value of the test instance on the candidate feature and this strategy may cause LazyDT to remount to the previous level node and return the most frequent class, which degrades the classification performance to some extent. On contrast, HDDT, HLazyDT and KLLazyDT that all use skew-insensitive splitting criteria can produce more branches to find more fine-grained differences in the datasets. Therefore, HDDT, HLazyDT and KLLazyDT can better distinguish between minority and majority class than LazyDT and C4.5. On the other hand, we note that HLazyDT and KLLazyDT significantly outperform HDDT across the highly imbalanced datasets. Since the number of the minority class instances in the highly imbalanced datasets is very rare, the values of the minority class instances on the some feature in the test set may not be include in the training set, which may cause C4.5, LazyDT and HDDT to remount to the previous level node and return the most frequent class, which degrades the classification performance to some extent. Furthermore, the minority class instances in the highly imbalanced datasets are indeed abundant in small disjuncts. Eager decision trees including C4.5 and HDDT tend to miss out on small disjuncts because they ignore certain instances when they occur infrequently, such as instances in very small disjuncts, while Lazy decision trees do not ignore any instances. This is the reason why we can still observe that the performances of LazyDT are superior to C4.5 according to both AUC and G-mean. Consequently, lazy learning methods are recommended when small disjuncts abound [44].
(4)
In our experiment, we observe that the performances of SVM are superior to all kinds of decision trees according to the two evaluation measures, namely AUC and G-mean. This result is consistent with the conclusion in [45, 46]. It demonstrates that SVM can achieve better generalization than decision trees to deal with nonlinear problems, while decision trees can suffer from overfitting easily.

6 Conclusion

In this paper, we use Hellinger distance and K-L divergence as the split criteria to modify lazy decision trees i.e., HLazyDT and KLLazyDT to deal with the problem of imbalanced classification. An experiment framework using five-fold cross-validation is performed across a wide range of imbalanced data sets to investigate the effectiveness of our methods when comparing with the other methods including lazy decision tree, C4.5, Hellinger distance based decision tree and support vector machine according to ROC and G-mean evaluation measures respectively. Adjusted p-value Holm’s post-hoc procedure of the Friendman test is used to determine the significance of the ranks across multiple data sets. Based on the experiment results, we demonstrate that it can improve the performances of LazyDT for imbalanced classification to use Hellinger distance and K-L divergence as the split criterion . According to the experimental results, we find that HLazyDT and KLLazyDT outperform the other decision trees for all imbalanced datasets used in our experiment. Especially, when using SMOTE to preprocess the highly imbalanced datasets, HLazyDT and KLLazyDT still achieve the comparable performance than HDDT. On the other hand, the performances of SVM are superior to all kinds of decision trees, which reveals that SVM can achieve better generalization than decision trees, while decision trees can suffer from overfitting easily. Finally, we recommend using HLazyDT and KLLazyDT to deal with imbalanced learning when limiting the approaches to decision trees. Meanwhile, we also consider that it is a good choice to apply HLazyDT and KLLazyDT to imbalanced data stream learning such as medical data-mining and privacy preserving area. In future work, we wish to extend our methods to deal with continuous features and multi-class imbalanced problem. Additionally, ensemble methods have been proven to improve decision trees to solve imbalanced leaning effectively [47] and a new ensemble strategy based on optimizing decision directed acyclic graph is proposed to deal with multi-class classification problems [48], which inspires us to use ensemble methods to further improve the ability of HLazyDT and KLLazyDT to tackle imbalanced learning.

Notes

References

Quinlan JR (2014) C4.5: Programs for machine learning. Elsevier, Amsterdam
Google Scholar
Friedman JH, Kohavi R, Yun Y (1996) Lazy decision trees. AAAI/IAAI 1:717–724
Google Scholar
Bagallo G, Haussler D (1990) Boolean feature discovery in empirical learning. Mach Learn 5(1):71–99
Article Google Scholar
Mahmoudi N, Duman E (2015) Detecting credit card fraud by modified fisher discriminant analysis. Expert Syst Appl 42(5):2510–2516
Article Google Scholar
Khor KC, Ting CY, Phon-Amnuaisuk S (2014) The effectiveness of sampling methods for the imbalanced network intrusion detection data set. Recent Advances on Soft Computing and Data Mining. Springer, Cham, pp 613–622
Wan X, Liu J, Cheung WK (2014) Learning to improve medical decision making from imbalanced data without a priori cost. BMC medical informatics and decision making 14(1):111
Article Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 12(9):1263–1284
Google Scholar
López V, Fernández A, García S (2014) An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
Article Google Scholar
Krawczyk B (2016) Learning from imbalanced data: Open challenges and future directions. Progress in Artificial Intelligence 5(4):221–232
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. International Conference on Intelligent Computing. Springer, Berlin, Heidelberg, pp 878–887
He H, Bai Y, Garcia EA (2008) ADASYN: Adaptive Synthetic sampling approach for imbalanced learning. IEEE International Joint Conference on Neural Networks, pp 1322–1328
Hu S, Liang Y, Ma L (2009) MSMOTE: Improving classification performance when training data is imbalanced. IEEE 2nd International Workshop on Computer Science and Engineering, pp 13–17
Barua S, Islam MM, Yao X (2014) MWMOTE–Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Article Google Scholar
Zhou P, Hu X, Li P (2017) Online feature selection for high-dimensional class-imbalanced data. Knowl-Based Syst 136:187–199
Article Google Scholar
Wu G, Chang EY (2005) KBA: Kernel Boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng 17(6):786–795
Article Google Scholar
Xu Y (2017) Maximum margin of twin spheres support vector machine for imbalanced data classification. IEEE Trans Cybern 47(6):1540–1550
Article Google Scholar
Xu Y, Wang Q, Pang X (2018) Maximum margin of twin spheres machine with pinball loss for imbalanced data classification. Appl Intell 48(1):23–34
Article Google Scholar
Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. IEEE Symposium on Computational Intelligence and Data Mining, pp 324–331
Chawla NV, Lazarevic A, Hall LO (2003) SMOTEBOost: Improving prediction of the minority class in boosting. European conference on principles of data mining and knowledge discovery. Springer, Berlin, Heidelberg, pp 107–119
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B (Cybernetics) 39(2):539–550
Article Google Scholar
Longadge R, Dongre S (2013) Class imbalance problem in data mining review. International Journal of Computer Science and Network 1305:1707
Google Scholar
Zhang Z, Krawczyk B, Garcìa S (2016) Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data. Knowl-Based Syst 106:251–263
Article Google Scholar
Cieslak DA, Chawla NV (2008) Learning decision trees for unbalanced data. Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Berlin, Heidelberg, pp 241–256
Cieslak DA, Hoens TR, Chawla NV (2012) Hellinger distance decision trees are robust and skew-insensitive. Data Min Knowl Disc 24(1):136–158
Article MathSciNet MATH Google Scholar
Hoens TR, Qian Q, Chawla NV (2012) Building decision trees for the multi-class imbalance problem. Pacific-asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, Heidelberg, pp 122–134
Lyon RJ, Brooke JM, Knowles JD (2014) Hellinger distance trees for imbalanced streams. IEEE International Conference on Pattern Recognition, pp 1969–1974
Chawla NV, Cieslak DA, Hall LO (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Disc 17(2):225–252
Article MathSciNet Google Scholar
Zhang H (2012) Lazy decision tree method for distributed privacy preserving data mining. International Journal of Advancements in Computing Technology 4(14):458–465
Article Google Scholar
Quinlan JR (1996) Bagging, boosting, and c4.5. AAAI/IAAI 1:725–730
Google Scholar
Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Mach Learn 40(2):139–157
Article Google Scholar
Fern XZ, Brodley CE (2003) Boosting lazy decision trees. In: Proceedings of the 20th International Conference on Machine Learning ICML, pp 178–185
Guillame-Bert M, Dubrawski A (2016) Batched Lazy Decision Trees. arXiv:1603.02578
Rao CR (1995) A review of canonical coordinates and an alternative to correspondence analysis using Hellinger distance. Qű,estiió 19(1):23–63
MathSciNet MATH Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Article MathSciNet MATH Google Scholar
Christopher DM, Prabhakar R, Hinrich S (2008) Introduction to information retrieval. An Introduction To Information Retrieval 151(177):5
MATH Google Scholar
Triguero I, González S, Moyano JM (2017) KEEL 3.0: An open source software for multi-stage analysis in data mining. International Journal of Computational Intelligence Systems 10(1):1238–1249
Article Google Scholar
Chawla NV (2003) C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Proceedings of the ICML, 3:66
Cortes C, Vapnik V (1995) Support vector machine. Mach Learn 20(3):273–297
MATH Google Scholar
Raeder T, Forman G, Chawla NV (2012) Learning from imbalanced data: evaluation matters. Data mining: Foundations and intelligent paradigms. Springer, Berlin, Heidelberg, pp 315–331
Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45(2):171–186
Article MATH Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30
MathSciNet MATH Google Scholar
García S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9(Dec):2677–2694
MATH Google Scholar
Van Den Bosch A, Weijters A, Van Den Herik HJ (1997) When small disjuncts abound, try lazy learning: A case study. Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning, pp 109–118
Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. ACM Proceedings of the 23rd international conference on Machine learning, pp 161–168
Fernández-Delgado M, Cernadas E, Barro S (2014) Do we need hundreds of classifiers to solve real world classification problems. J Mach Learn Res 15(1):3133–3181
MathSciNet MATH Google Scholar
Banfield RE, Hall LO, Bowyer KW (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29(1):173–180
Zhou L, Fujita H (2017) Posterior probability based ensemble strategy using optimizing decision directed acyclic graph for multi-class classification. Inf Sci 400:142–156
Article Google Scholar

Download references

Acknowledgements

We would like to acknowledge support for this project from China Postdoctoral Science Foundation (2016M600430), the National Social Science Foundation of China (16ZDA054), Jiangsu Provincial 333 Project (BRA2017396), Six Major Talents PeakProject of Jiangsu Province (XYDXXJS-CXTD-005) and Philosophy and social science in colleges and universities in Jiangsu Province outstanding innovation team (2015ZSTD006). The authors also would like to express our gratitude to the donors of the different data sets and the maintainers of the KEEL Data set Repository.

Author information

Jie Cao
Present address: , Nanjing, 210044, China

Authors and Affiliations

Key Laboratory of Meteorological Disaster, Ministry of Education (KLME), Joint International Reseach Laboratory of Climate and Environment Change (ILCEC), Collaborative Innovation Center on Forecast and Evaluation of Meteorological Disasters (CIC-FEMD), School of Information and Control, Nanjing University of Information Science and Technology, Nanjing, 210044, China
Chong Su
School of Mathematical and Statistics, Nanjing University of Information Science and Technology, Nanjing, 210044, China
Jie Cao

Authors

Chong Su
View author publications
You can also search for this author in PubMed Google Scholar
Jie Cao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Cao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Su, C., Cao, J. Improving lazy decision tree for imbalanced classification by using skew-insensitive criteria. Appl Intell 49, 1127–1145 (2019). https://doi.org/10.1007/s10489-018-1314-z

Download citation

Published: 25 October 2018
Issue Date: 15 March 2019
DOI: https://doi.org/10.1007/s10489-018-1314-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Improving lazy decision tree for imbalanced classification by using skew-insensitive criteria

Abstract

Similar content being viewed by others

A Robust Classifier for Imbalanced Datasets

Addressing Local Class Imbalance in Balanced Datasets with Dynamic Impurity Decision Trees

One-against-all-based Hellinger distance decision tree for multiclass imbalanced learning

1 Introduction

2 Related work