1 Introduction

Decision trees are among the more popular classification methods due to their efficiency, simplicity and interpretability. Algorithms for constructing decision trees such as C4.5 [1] create a single decision tree during the training phase and then use the tree to classify test instances. In contrast, the lazy decision tree (LazyDT) builds a customized tree for each test instance, which consist of only a single path from the root to leaf node [2]. There are two strengths in comparison with eager decision trees. Firstly, the decision paths built by LazyDT are often shorter than paths of eager decision trees. Secondly, LazyDT can overcome the data fragment problem [3]. However, LazyDT has not taken into account the imbalanced data in which one class (the majority class) vastly outnumbers the other (the minority class). Due to the natures of learning algorithms, learning from imbalanced data sets is still a challenging problem and exists widely in many domains including, but not limited to, fraud detection [4], network intrusion [5], medical diagnosis [6] and so on. In order to overcome the class imbalance problem, many methods have been proposed and can be roughly divided into three categories as follows [7,8,9]. (1) Data-level methods often concentrate on modifying the training set until all the classes are approximately equally represented. The most common methods employed are undersampling of the majority class and oversampling of the minority classes. Since undersampling of the majority class does not take full advantage of the useful information from the majority class, oversampling has received more attention. A very popular approach is SMOTE (Synthetic Minority Oversampling TEchnique), which increases diversity by generating pseudo minority class data [10]. Depending on the technique of how synthetic samples will be generated, Several variants of SMOTE have been proposed such as Borderline-SMOTE [11], Adaptive Synthetic Sampling Technique (ADASYN) [12], MSMOTE [13], and MWMOTE [14]. In addition, when a dataset is highly dimensional and highly class imbalanced, existing online feature selection algorithms usually ignore the small classes which can be important in these applications. It is hence a challenge to select the features from highly dimensional and class imbalanced data in an online manner. A new Online Feature Selection based on the Dependency in K nearest neighbors (K-OFSD) uses the information of nearest neighbors to select relevant features [15]. (2) Algorithm-level methods directly modify existing algorithms to alleviate the bias towards the majority class. The most common methods modified are decision trees and support vector machines (SVMs). For decision trees, the main approach is to use the skew-insensitive split criteria to generate the decision tree, which can be seen from the next paragraph for details. For SVMs, one approach is to adjust the class boundary towards the majority class based on the kernel-alignment ideal [16]; the other is to modify twin support vector machine (TSVM) such as maximum margin of twin spheres support vector machine (MMTSSVM) and maximum margin of twin spheres support vector machine with pinball loss (Pin-MMTSSVM). MMTSSVM finds two homocentric spheres by solving a quadratic programming problem and a linear programming problem. The small sphere contains as many positive samples as possible, while most negative samples are pushed outside the large sphere [17]. In contrast with MMTSSVM, Pin-MMTSSVM uses pinball loss instead of hinge loss used in MMTSSVM to improve the generalization performance of MMTSSVM [18]. (3) Hybrid methods can combine the advantages of two previous groups. One strategy is to resample the dataset and then use the standard classifiers such as decision trees, Naive Bayes and SVM etc. The other is combine pre-processing with bagging and boosting. SMOTEBagging [19], SMOTEBoost [20], EasyEnsemble [21] and BalanceCascade [21] are examples of this type of approaches. By contrast, the Hybrid methods have more advantages and vaster development space [22]. Moreover, multi-class imbalance classification is widely applied in many areas. In recent years, a new ensemble learning is proposed to tackle the multi-class imbalance classification, which uses OVO (one vs one) decomposition scheme to divided m-class imbalance classification problem into m(m − 1)/2 binary subproblems and utilizes a confidence degree matrix to finish the aggregation [23].

In terms of decision trees, the split criteria used in decision trees such as Gini index, information gain and DKM have been proven to be skew-sensitive and the Hellinger distance is proposed as a split criterion to build Hellinger distance decision trees (HDDT) [24, 25]. Meanwhile, the variants of HDDT are proposed to deal with different problems. For example, the Multi-Class HDDT (MHDDT) employs the techniques similar to Error Correcting Output Codes Decomposition (ECOC) to deal with the multi-class imbalance classification [26]; Gaussian Hellinger Very Fast Decision Tree (GH-VFDT) computes the Hellinger distance between two normal distributions P and N straightforwardly, which means it can deal with imbalanced data streams [27]. As for LazyDT, it adopts information gain to build the decision path, which impedes the ability of LazyDT to learn the minority class concept. So for all of these reasons, we use Hellinger distance and K-L divergence as split criteria to build LazyDT respectively, namely lazy decision tree based on Hellinger distance (HLazyDT) and lazy decision tree based on K-L divergence (KLLazyDT). An experimental framework is performed across a wide range of imbalanced data sets to investigate the effectiveness of our methods when comparing with the other methods including lazy decision tree, C4.5, Hellinger distance based decision tree and support vector machine. The data sets used in this experiment are categorized into two groups: highly and lowly imbalanced data sets according to imbalanced ratio. In addition, we also use SMOTE [10] to preprocess the highly imbalanced data sets in the experiment and evaluate its effectiveness since it has become the de facto standard for improving the performances of the decision tree algorithms [28]. The experimental results, which are contrasted through nonparametric statistical tests, demonstrate that using Hellinger distance and K-L divergence as the split criterion to build LazyDT can improve the performances of LazyDT for imbalanced classification effectively.

In summary, the key contributions of this paper are as follows. (1) We analyse the reason why the split criterion used in LazyDT is skew-sensitive. (2) We use Hellinger distance and K-L divergence as the split criteria used in LazyDT, which demonstrates that the modification can improve the performances of LazyDT for imbalanced classification effectively. (3) We investigate the effectiveness of the SMOTE preprocessing.

The rest of this paper is organized as follows. In Section 2, some related works are introduced. Section 3 provides the formulations and some related properties of Hellinger distance and K-L divergence respectively. Section 4 describes Hellinger distance based lazy decision tree (HLazyDT) and K-L divergence based lazy decision tree (KLLazyDT). Section 5 states the used experimental framework and the analysis of experimental results. Finally, the conclusions obtained in this work are shown in Section 6.

2 Related work

As with many lazy algorithms, the first part of training phase is non-existent and all the work of LazyDT is done during the classification of a given test instance. LazyDT, which gets the test instance as part of the input, follows a separate and classify methodology: a feature is chosen and the sub-problem containing the instances with the same feature value as the given instance is then solved recursively. The overall effect is that of tracing a path in an imaginary tree built for the test instance. The heart of LazyDT is information gain that is used as the split criterion to select the split feature. Unlike eager decision trees, LazyDT takes into account the information of not only the training instances but also the test instance. Essentially, LazyDT builds a single decision path for a given test instance, which can avoid splits on the features that are irrelevant for the specific instance. Therefore, LazyDT avoids unnecessary data fragmentation and duplicate subtrees which are shown in Fig. 1 and may produce a shorter and more accurate decision path for the specific instance. On the other hand, three problems need to be solved in the framework of LazyDT. The first is the way to handle missing values. LazyDT will ignore the missing feature and just never branch on a missing value in the test instance. The second problem is that the information gain may be negative or zero. In this case, LazyDT will return the most frequent class in the training set which covered by the current node. The last problem is that there may be no training instances with the same feature value as the given instance. At this time, LazyDT will remount to the parent node and return the most frequent class in the training set. At last, LazyDT is also applied to distributed privacy preserving area in recent years from an application point of view [29].

Fig. 1
figure 1

Data fragmentation and duplicate subtrees in eager decision tree

Since Boosting can improve the performance of various decision trees effectively [30, 31], Fern X Z and Brodley C E proposed a relevance-based boosting lazy decision trees which consist of two steps [32]. Firstly, it builds a customized LazyDT ensemble for each test instance. All the training instances are equally weighted at the beginning. In each iteration, a decision path for the given test instance is produced by applying LazyDT to the training set with instance weights. The instance weight is then adjusted to form a new distribution for the next iteration according to how relevant this instance is to classifying the given test instance and whether its class label is predicted correctly by the current decision path. The relevance level is the depth of the node at which the training instance is discarded when producing the given decision path. Secondly, the relevance-based boosting lazy decision trees also adopt a distance-based pruning strategy to address the problem of over-fitting. In each iteration, the pruning strategy performs a greedy search to select a deletion of feature in the light of the ratio of the heterogeneous distance to the homogeneous distance. The heterogeneous distance measures the distance between the test instance and the closest cluster of instances from the other classes, while the homogeneous distance is defined as the distance between the test instance and the most frequent class in the current set. Although an empirical comparison to the other boosted regular decision trees shows that the relevance-based boosting lazy decision trees achieve comparable accuracy and better comprehensibility, the distance-based pruning strategy is biased against the minority class. Thus, it is difficult for the relevance-based boosting lazy decision trees to deal with the imbalanced data sets.

Due to the fact that LazyDT needs to be re-run independently for each test instance, the main drawback of it is its high computational cost in the classification phase. In order to reduce the computational cost of LazyDT, a batched lazy decision tree was introduced by Guillame-Bert M and Dubrawski A [33]. The batched lazy decision tree substitutes a subtree for a single path and each necessary node is visited only once through a single pass. This algorithm uses the value of the test instance on the split feature to filter the training set and the test set simultaneously in each iteration. Once the termination condition is satisfied, the filtered test instances are labeled as the most frequent class.

In summary, these algorithms mentioned above do not take into account imbalanced learning. The framework of them all takes advantage of information gain as split criterion. It is well known that information gain is a measure of impurity. As far as the imbalanced data sets are concerned, the distributions of the classes are unbalanced and it means that the impurity of imbalanced data is lower than balanced data. When using information gain as the split criterion to choose the feature in a imbalanced data set, the information gain may be zero or even negative, which will lead to cessation of tree growth. In this way, it is unreasonable that all of the test instances may be classified as the majority class, which decreases the performances of the classifier. This is the reason why information gain is skew-sensitive.

3 Hellinger distance and K-L divergence

3.1 Hellinger distance

Hellinger distance is one of measures which reflects divergences between two probability distributions [34]. Given a countable space Φ, let P and Q be two distributions and Hellinger distance can be defined as follows.

$$ D_{H}(P,Q)= \sqrt{\sum\limits_{\phi \in {\Phi}}(\sqrt{P(\phi)} - \sqrt{Q(\phi)})^{2}}\quad. $$
(1)

The Hellinger distance carries the following properties.

  1. (1)

    DH(P,Q) is in [0,\(\sqrt {2}\)].

  2. (2)

    The Hellinger distance is symmetric and non-negative.

  3. (3)

    The bigger Hellinger distance is, the better discriminations of the probabilities are.

The Hellinger Distance with respect to a feature for two-class problem is given as follows, which is firstly used as a split criterion in decision trees [24].

$$ D_{H}(X_{+},X_{-})=\sqrt{\sum\limits_{j = 1}^{P}(\sqrt{\frac{\vert X_{+j} \vert}{\vert X_{+} \vert}} - \sqrt{\frac{\vert X_{-j} \vert}{\vert X_{-}\vert}})^{2}}\quad, $$
(2)

where |X+| indicates the number of instances that belong to the minority class in the training set and |X+j| specifies the subset of the training set with the minority class and value j for the feature X. Similarly, they are the same explanations of |X| and |Xj| but for the majority class. In addition, P is the number of different values in the feature X. Since (2) is not influenced by prior probability, it is insensitive to class distributions.

3.2 K-L divergence

Kullback-Leibler Divergence is an asymmetric measure which reflects divergences between two probability distributions and generally abbreviates to K-L divergence [35]. Suppose Φ is a countable space and both P and Q are two probability distributions in Φ respectively. \(D_{kl}(P,Q)={\sum }_{\phi \in {\Phi }} \left (\ln (\frac {P(\phi )}{Q(\phi )}) \cdot P(\phi )\right )\) indicates the K-L divergence of Q from P. Similarly, the K-L divergence of P from Q can be defined as \(D_{kl}(Q,P)={\sum }_{\phi \in {\Phi }} \left (\ln (\frac {Q(\phi )}{P(\phi )}) \cdot Q(\phi )\right )\). In order to make K-L divergence satisfy the symmetry, a symmetrised form of K-L divergence is defined as follows.

$$\begin{array}{@{}rcl@{}} D_{kl-symmetry}(P,Q)&=&\frac{1}{2} \cdot D_{kl}(P,Q) + \frac{1}{2} \cdot D_{kl}(Q,P) \\ &=&\frac{1}{2} \cdot \sum\limits_{\phi \in {\Phi}}\left( \ln(\frac{P(\phi)}{Q(\phi)}) \cdot P(\phi) \right)\\ &&+ \frac{1}{2} \cdot \sum\limits_{\phi \in {\Phi}} \left( \ln(\frac{Q(\phi)}{P(\phi)}) \cdot Q(\phi)\right) \\ &=&\frac{1}{2} \cdot \sum\limits_{\phi \in {\Phi}}\left( \left( \ln P(\phi) - \ln Q(\phi)\right)\right. \cdot P(\phi)\\ &&+\left.\left( \ln Q(\phi) - \ln P(\phi)\right) \cdot Q(\phi) \right) \\ &=&\frac{1}{2} \cdot \sum\limits_{\phi \in {\Phi}}\left( \left( P(\phi)\right.\right.\\ &&\left.-~Q(\phi)\right) \cdot \ln\left.\left( \frac{P(\phi)}{Q(\phi)}\right)\right). \end{array} $$
(3)

Especially, when using the symmetrised form of K-L divergence as the splitting criterion of decision trees for two-class problem, we can ignore the constant term namely \( \frac {1}{2} \) and rewrite the (3) as follows.

$$ D_{kl-symmetry}(X_{+},X_{-})=\sum\limits_{j = 1}^{P}\left( \left( \frac{\vert X_{+j} \vert}{\vert X_{+} \vert} - \frac{\vert X_{-j} \vert}{\vert X_{-}\vert}\right) \cdot \ln\left( \frac{\frac{\vert X_{+j}\vert}{\vert X_{+}\vert}} {\frac{\vert X_{-j}\vert}{\vert X_{-}\vert}}\right) \right). $$
(4)

Since all symbols used in the (4) have the same illustrations as the (2), we will not repeat them here and please refer to the Section 3.1 for details. It essentially reflects the divergence between the feature value distributions with respect to two different classes and has the following properties.

  1. (1)

    Dklsymmetry(X+,X) is bounded in [0, + ).

  2. (2)

    Dklsymmetry(X+,X) satisfy symmetry and non-negativeness.

  3. (3)

    The bigger Dklsymmetry(X+,X) is, the better discriminations of the feature are.

However, there exists zero probability problem which makes the upper bound of the (4) infinite. Smoothing methods are generally used to avoid this problem. In this paper, we adopt Laplace smoothing namely additive smoothing. For example, assuming |X+j| = 0 and there are mj different values in the feature X, the smoothed conditional probability is defined as follows.

$$ P(X_{j} \vert +)=\frac{\vert X_{+j} \vert + 1}{\vert X_{+} \vert + \lambda \cdot m_{j}}, $$
(5)

where λ is the smoothing parameter. Here, we set λ = 1, that is to say, Laplace’s law of succession [36]. In this way, we can use the (5) to tackle zero probability problem.

4 Hellinger distance and K-L divergence based lazy decision trees

Although LazyDT avoids unnecessary data fragmentation and duplicate subtrees and may produce a shorter and more accurate decision path for the specific instance, it is still difficult for LazyDT to deal with imbalanced problem. There are two main defects which impede LazyDT’s ability to learn from imbalanced data. one is that information gain used by LazyDT is skew-sensitive and the other is that LazyDT do not take into account the value of the test instance on the alternative feature in the procedure of selecting the split feature, which may give rise to the problem that there are no instances in the new node when filtering the training set by the value of the test instance on the split feature. In this way, LazyDT will cease to split and return the most frequent class in the training set covered by the current node, which degrades the classification performance of LazyDT. In order to overcome the two defects mentioned above, we use the value of the test instance on the candidate feature to modify Hellinger distance and K-L divergence respectively.

4.1 Hellinger distance based lazy decision trees (HLazyDT)

For the feature X, let v represent the value of a given test instance. Given two-class problem, we can calculate the Hellinger distance of the value v on the feature X as follows.

$$ W_{H-v}(X_{+},X_{-})=\left| \sqrt{\frac{\vert X_{+v} \vert}{\vert X_{+} \vert}} - \sqrt{\frac{\vert X_{-v} \vert}{\vert X_{-}\vert}} \right| \quad , $$
(6)

where |X+| indicates the number of examples that belong to the minority class in the training set and |X+v| specifies the subset of training set with the minority class and the value v for the feature X. Similarly, they are the same explanations of |X| and |Xv| but for the majority class. The bigger is WHv, the better discriminations of the value v on the feature X are. It reflects the contribution of the value v to the discrimination of the feature X.

As mentioned above, we combine the (2) with the (6) and give the formula of Hellinger distance based split criterion used in HLazyDT as follows.

$$ Split_{H}=\sqrt{W_{H-v}(X_{+},X_{-}) \cdot D_{H}(X_{+},X_{-})} \quad , $$
(7)

where DH(X+,X) and WHv(X+,X) can be calculated according to the (2) and the (6) respectively. Especially, when the value v of the test instance on the feature X does not belong to the set of the values on the feature X in the training set, both the (6) and (7) will be equal to zero. In fact, the (7) calculates the ability of the feature X to discriminate the two classes based not only on all values of the feature X but also on the value of the test instance on the feature X.

The following Algorithm 1 outlines the approach for calculating Hellinger distance and the procedure for inducing HLazyDT. In this algorithm, Ty = i indicates the subset of the training set T with class i, Tf = j specifies the subset with the value j for the feature f and Tf = j,y = i identifies the subset with the class i and has the value j for the feature f. Given two-class problem, class i is drawn from some finite set of classes like the minority class(+) and the majority class(-).

figure a

4.2 K-L divergence based lazy decision trees (KLLazyDT)

Similarly, we firstly calculate K-L divergence of the value v as follows.

$$ W_{KL-v}(X_{+},X_{-})=\left( \frac{\vert X_{+v} \vert}{\vert X_{+} \vert} - \frac{\vert X_{-v} \vert}{\vert X_{-}\vert}\right) \cdot \ln\left( \frac{\frac{\vert X_{+v}\vert}{\vert X_{+}\vert}} {\frac{\vert X_{-v}\vert}{\vert X_{-}\vert}}\right) , $$
(8)

where all symbols have the same illustrations as the (6) and please refer to the Section 4.1 for details. It reflects the discriminations of the value v of the test instance on the feature X. Next, we describe the formula of K-L divergence based split criterion used in KLLazyDT as follows.

$$ Split_{KL}=\sqrt{W_{KL-v}(X_{+},X_{-}) \cdot D_{kl-symmetry}(X_{+},X_{-})} \quad , $$
(9)

where Dklsymmetry(X+,X) and WKLv(X+,X) can be calculated according to the (4) and (8) respectively. Likewisely, if the value v of the test instance on the feature X does not belong to the set of the values on the feature X in the training set, both the (4) and (8) will also be equal to zero. Moreover, it is worth mentioning that we still need use the (5) to smooth the (4) and (8) when zero probability problems come out. The framework of the K-L divergence based lazy decision trees is described in Algorithm 2. Please refer to Algorithm 1 for the illustrations of the related symbols.

Finally, we focus on the time and memory complexity of eager decision trees and lazy decision trees, which have been discussed in [33]. In order to make this paper self-contained for all its related information, we just restate the related discussions. For convenience of discussion, we consider binary splitting for decision trees, a training dataset with n instances and f attributes as well as a test dataset with t instances and f attributes. We assume that there are a minimum of p training instances in expanding nodes. (1) The time complexity of an eager decision tree model is \(O({\sum }_{i = 0}^{\lg \frac {n}{p}} 2^{i} \cdot f \cdot \frac {n}{2^{i}})= O(n \cdot f \cdot \lg \frac {n}{p})\) and the time complexity of a lazy decision tree model is \(O(t \cdot {\sum }_{i = 0}^{\lg \frac {n}{p}} f \cdot \frac {n}{2^{i}})= O(t \cdot f \cdot n) \) on average. Note that \( \lg \frac {n}{p} \) is the average depth of exploration, 2i is the average number of nodes at depth i, and \( \frac {n}{2^{i}} \) is the average number of training instances contained in a node at depth i. Thus, unless the number of unlabeled observations is small, lazy decision trees can be slower in prediction mode than the equivalent eager decision trees. (2) Building an eager decision tree requires memories for the stack operations used during the recursive construction of the tree as well as some storages for the final trained model, while Lazy decision trees only require the stack. The stack size and model size of decision trees are \(O({\sum }_{i = 0}^{\lg \frac {n}{p}} \frac {n}{2^{i}}) = O(n)\) and \( O(\frac {n}{p})\) (the number of non-leaf nodes in a decision tree) on average respectively. According to the notation mentioned above, the required memories of an eager decision tree is \( O(n) + O(\frac {n}{p}) \) on average in the training phase and \( O(\frac {n}{p}) \) in the testing phase, while Lazy decision trees only require O(n). It means that eager decision tree requires more memories in the training phase than lazy decision trees, while in the testing phase it yet requires fewer memories than lazy decision trees.

figure b

5 Experiments

The whole experiment is divided into the following steps. Firstly, we categorize the experimental data sets into three groups: lowly (IR < 9) imbalanced data sets, highly (IR ≥ 9) imbalanced data sets and highly imbalanced data sets with SMOTE preprocessing. Secondly, we use uniform-frequency method to discretize the datasets used in the experiment. Finally, we use five-fold cross-validation to train and test the methods which participate in the experiment and obtain and analyze the experimental results. In order to depict the whole procedure of our experiment clearly, we give the whole framework of our experiment as shown in Fig. 2. The methods’ abbreviations in Fig. 2 are shown in Table 1.

Fig. 2
figure 2

The framework of the whole experiment

Table 1 All methods used in the experiment and their parameters settings and abbreviations

5.1 Experiment tool

In this section, we present the set up of the experimental framework that is used to develop the analysis of our proposals. Moreover, we use KEEL (Knowledge Extraction based on Evolutionary Learning)Footnote 1 to finish the whole experiment, which empowers the user to assess the behavior of evolutionary learning and soft computing based techniques for all kinds of data mining problems: regression, classification, clustering, pattern mining and so on. KEEL is an open source Java framework (GPLv3 license) that provides a number of modules to perform a wide variety of data mining tasks. It includes tools to perform data management, add a new algorithm, design of multiple kind of experiments, statistical analyses, etc and provides a simple GUI to design experiments with different data sets and computational intelligence algorithms in order to assess the behaviour of the algorithms. The latest version of it is KEEL3.0, which includes new modules for semi-supervised learning, multi-instance learning, imbalanced classification and subgroup discovery [37]. In our experiment, We first use KEEL3.0 to implement Hellinger distance based lazy decision tree and K-L divergence based lazy decision tree respectively and then use the imbalanced classification and datasets provided by KEEL3.0 to finish the whole experiment.

5.2 Methods

In this experiment, we compare Hellinger distance based lazy decision tree (HLazyDT) and K-L divergence based lazy decision tree (KLLazyDT) with the other methods including lazy decision tree (LazyDT), C4.5, Hellinger distance based decision tree (HDDT) and support vector machine (SVM) across lowly and highly imbalanced data sets. Thereinto, C4.5 is a popular algorithm for eager decision trees which was used in the experiment of lazy decision tree [1]. Since imbalanced classification is the focus of this experiment, we use the version of C4.5 with unpruning in this experiment, which was proven to be an effective strategy [38]. Similarly, HDDT is an eager decision tree which is designed for imbalanced classification [24, 25]. Additionally, SVM is a popular approach for data classification and has been successfully applied in various applications [39]. Since SVM does not belong to the family of decision trees, we only use SVM as the comparison object in the experiment and the statistical analysis of the experimental results will not include it. Finally, all methods used in this experiment, their parameters setting as well as their abbreviations are shown in Table 1. We will use these abbreviations of all methods to denote for the rest of this paper.

5.3 Datasets and data partitions

In this experiment, we select data sets from KEEL data set repositoryFootnote 2 and categorized these data sets into two groups: lowly (IR < 9) and highly (IR ≥ 9) imbalanced data sets according to the ratio of the number of examples from the majority class to the minority class(IR). Although there is no consensus in the literature about when a data set is considered highly imbalanced, we will consider that the ratio of the number of examples from the majority class to the minority class (IR) above 9 represents a highly imbalanced data set in this paper due to the fact that ignoring the minority class examples by a classifier still obtains an error of 0.1 in accuracy. The characteristics of data sets are summarized in Tables 2 and 3 where we denote the number of examples, the number of minority class examples, the number of features, the number of continuous features, the number of discrete features and imbalanced ratio by “Examples”, “Minority”, “Features”, “Continuous”, “Discrete” and “IR” respectively. These data sets are two-class problem.

Table 2 Lowly imbalanced data sets used in the experiment
Table 3 Highly imbalanced data sets used in the experiment

On the other hand, when evaluating each of these approaches on the data sets, we firstly make use of the equal-frequency bins to discretize all used data sets since the family of the lazy decision trees only deals with the discrete data. Secondly, in order to explore what effect SMOTE [10] have on these methods used in this experiment, we also use SMOTE to preprocess the highly imbalanced data sets and compare the performances of these methods. Since our goal is to explore SMOTE to obtain the balanced data sets, the SMOTE levels (SL) for minority class can be calculated as follows.

$$ SL=(\frac{n - n_{+}}{n_{+}} -1) \times 100\%. $$
(10)

Finally, we performed five-fold cross-validation. In this procedure, each data set is randomly partitioned into five equal sized and disjoint subsets, a single subset is retained as the validation data for testing the model and the remaining four subsets are used as the training data. The cross-validation process is then repeated five times and the mean of five repetitions is considered as the final results.

5.4 Evaluation measure

Since the behavior of different evaluation measures can result in potentially different conclusions, it is suggested to use more evaluation measures to evaluate classifier especially when learning from imbalanced datasets [40]. Therefore, we employee the area under the receiver operating characteristic(ROC) curve and G-mean to compare the performance of all methods used in this experiment respectively which are two popular evaluation measures for imbalanced problem.

  1. (1)

    The area under the ROC Curve(AUC) can be calculated as follows [41].

    $$ AUC=\frac{2S_{0}-n_{+}(n_{+}+ 1)}{2n_{+}n_{-}}, $$
    (11)

    where n+ and n indicate instances of the minority class and the majority class respectively. S0 is the sum of ranks of the minority class instances.

  2. (2)

    G-mean is computed as the geometric mean of the true positive and true negative rates, i.e.,

    $$ G-mean=\sqrt{\frac{TP}{TP+FN} \cdot \frac{TN}{TN+FP}}, $$
    (12)

    where TP indicates the number of true positive instances, TN denotes the number of true negative instances, FP means the number of false positive instances, and FN is the number of false negative instances.

5.5 Statistical tests for performance comparison

For the sake of the analysis of the experimental results, non-parametric tests are used for statistical comparisons of all methods used in this experiment due to the fact that the initial conditions that guarantee the reliability of the parametric tests may not be satisfied [42, 43]. In this experiment, we use adjusted p-value Holm procedure to find which methods are distinctive among a 1 × n comparison. In this procedure, we firstly obtain the average ranks of the methods according to the Friendman procedure. Especially, when in the case of a tie, the average rank is assigned to each method. For instance, if the performance of two methods tie for first, a rank of 1.5 is assigned to each, or if three tie for first, a rank of 2 is assigned to each and so on. Secondly, we need choose a base method or a control method. In our experiment, we choose the method with minimum average rank as the base method and arrange the others in ascending order according to the average ranks. The test statistics for comparing the ith with the base method is as follows.

$$ Z=\frac{R_{i} - R_{b}}{\sqrt{\frac{K(K + 1)}{6N}}}, $$
(13)

where Ri and Rb are the average ranks computed through the Friendman test for the ith method and the base method respectively. K is the number of methods to be compared and N is the number of data sets used in the comparison. The Z value is used to find the corresponding probability(P-value) from the table of normal distribution, which is then compared with an appropriate level of significance α. In adjusted P-value Holm procedure, it starts with the most significant P value. The ith adjusted P-value (APVi) is defined as follows.

$$ APV_{i}=min\{v;1\}, where~v=max\{(k-j)P_{j},1\leq j \leq i\}. $$
(14)

If APVi < α, the null hypothesis, namely there is no significant difference in the ranks of two methods, is rejected. As soon as a certain null hypothesis cannot be rejected, all remaining hypotheses are retained as well.

5.6 Experiment results and analysis

In this section, we present the empirical analysis of our methods including HLazyDT and KLLazyDT in order to determine their robustness in three different scenarios namely lowly imbalanced data sets, highly imbalanced data sets and highly imbalanced data sets with SMOTE preprocessing. We firstly compare the average AUCs and G-means among these different approaches and then we also compare the average ranks among these different approaches at 95% confidence interval according to AUCs and G-means respectively. If there are statistically significant difference between the base approach and the others, we put a “” sign on the corresponding method, which are shown in Table 10.

The experimental results are shown in Tables 45678, and 9 respectively. Tables 46, and 8 show the AUC results for each method across three types of the datasets used in this experiment and the rest tables show the G-mean results for each method across three types of the datasets. To compare the average AUC results and G-mean results clearly, we also make the charts which are shown in Figs. 3 and 4 respectively. Finally, The observations can be described as follows.

  1. (1)

    When comparing the results across the lowly imbalanced data sets, HLazyDT and KLLazyDT outperform the other trees except for SVM according to AUC, which also can be seen in Fig. 3. As shown in Table 10, we note that HLazyDT and KLLazyDT perform statistically significantly better than LazyDT at 95% confidence interval according to the average ranks of AUCs. Similarly, when focusing on the G-mean results, we can obtain the same results as AUC which can be seen from Table 5 and Fig. 4.

  2. (2)

    When comparing the results across the highly imbalanced data sets, both Tables 6 and 7 show that HLazyDT and KLLazyDT outperform the other trees except for SVM according to AUC or G-mean. This result mentioned above also can be seen from Figs. 3 and 4. Moreover, we can find from Table 10 that HLazyDT and KLLazyDT perform statistically significantly better than the other trees at 95% confidence interval according to both the average ranks of AUC and G-mean.

  3. (3)

    When we use SMOTE to preprocess the highly imbalanced data sets, it can improve the performances of all methods used in the experiment. As shown in Table 10, HDDT performs statistically significantly better than C4.5, LazyDT and KLLazyDT at 95% confidence interval according to both the average ranks of AUC and G-mean. In addition, Tables 8 and 9 also show that HLazyDT and KLLazyDT outperform C4.5 and LazyDT except for SVM and HDDT. However, HLazyDT and KLLazyDT still achieve the comparable performance (neither significantly worse) than HDDT according to the average AUCs and G-means, which can be observed from Figs. 3 and 4 clearly.

Table 4 The AUC results across lowly imbalanced data sets
Table 5 The G-means results across lowly imbalanced data sets
Table 6 The AUC results across highly imbalanced data sets
Table 7 The G-means results across highly imbalanced data sets
Table 8 The AUC results across highly imbalanced data sets with SMOTE
Table 9 The G-means results across highly imbalanced data sets with SMOTE
Table 10 Average Friedman rankings and APVs of various decision trees using Holm’s procedure in ROC and G-means
Fig. 3
figure 3

Comparison of the average ROC results for each method across all data sets

Fig. 4
figure 4

Comparison of the average G-means results for each method across all data sets

Finally, we provide the following analyses and conclusions.

  1. (1)

    It is obvious that the AUC and G-mean results not just associate with class imbalance ratio, which can be observed from Tables 45678, and 9. As it turns out, data set complexity such as overlapping, lack of representative data, small disjuncts, and others is the primary determining factor of classification deterioration [7, 8].

  2. (2)

    SMOTE preprocessing is very helpful to improve the performances of all methods used in this experiments when learning from highly imbalanced data sets. The main reason is that SMOTE creates artificial data based on the feature space similarities between the existing minority class instances and helps in broadening the decision region for a classifier and improving generalization [10].

  3. (3)

    The information gain ratio used by C4.5 prefers to choose the feature with fewer values and is skew-sensitive. Thus, C4.5 can not always select the best split feature. LazyDT uses the information gain to choose the split feature, which is also skew-sensitive. Moreover, LazyDT does not take into account the value of the test instance on the candidate feature and this strategy may cause LazyDT to remount to the previous level node and return the most frequent class, which degrades the classification performance to some extent. On contrast, HDDT, HLazyDT and KLLazyDT that all use skew-insensitive splitting criteria can produce more branches to find more fine-grained differences in the datasets. Therefore, HDDT, HLazyDT and KLLazyDT can better distinguish between minority and majority class than LazyDT and C4.5. On the other hand, we note that HLazyDT and KLLazyDT significantly outperform HDDT across the highly imbalanced datasets. Since the number of the minority class instances in the highly imbalanced datasets is very rare, the values of the minority class instances on the some feature in the test set may not be include in the training set, which may cause C4.5, LazyDT and HDDT to remount to the previous level node and return the most frequent class, which degrades the classification performance to some extent. Furthermore, the minority class instances in the highly imbalanced datasets are indeed abundant in small disjuncts. Eager decision trees including C4.5 and HDDT tend to miss out on small disjuncts because they ignore certain instances when they occur infrequently, such as instances in very small disjuncts, while Lazy decision trees do not ignore any instances. This is the reason why we can still observe that the performances of LazyDT are superior to C4.5 according to both AUC and G-mean. Consequently, lazy learning methods are recommended when small disjuncts abound [44].

  4. (4)

    In our experiment, we observe that the performances of SVM are superior to all kinds of decision trees according to the two evaluation measures, namely AUC and G-mean. This result is consistent with the conclusion in [45, 46]. It demonstrates that SVM can achieve better generalization than decision trees to deal with nonlinear problems, while decision trees can suffer from overfitting easily.

6 Conclusion

In this paper, we use Hellinger distance and K-L divergence as the split criteria to modify lazy decision trees i.e., HLazyDT and KLLazyDT to deal with the problem of imbalanced classification. An experiment framework using five-fold cross-validation is performed across a wide range of imbalanced data sets to investigate the effectiveness of our methods when comparing with the other methods including lazy decision tree, C4.5, Hellinger distance based decision tree and support vector machine according to ROC and G-mean evaluation measures respectively. Adjusted p-value Holm’s post-hoc procedure of the Friendman test is used to determine the significance of the ranks across multiple data sets. Based on the experiment results, we demonstrate that it can improve the performances of LazyDT for imbalanced classification to use Hellinger distance and K-L divergence as the split criterion . According to the experimental results, we find that HLazyDT and KLLazyDT outperform the other decision trees for all imbalanced datasets used in our experiment. Especially, when using SMOTE to preprocess the highly imbalanced datasets, HLazyDT and KLLazyDT still achieve the comparable performance than HDDT. On the other hand, the performances of SVM are superior to all kinds of decision trees, which reveals that SVM can achieve better generalization than decision trees, while decision trees can suffer from overfitting easily. Finally, we recommend using HLazyDT and KLLazyDT to deal with imbalanced learning when limiting the approaches to decision trees. Meanwhile, we also consider that it is a good choice to apply HLazyDT and KLLazyDT to imbalanced data stream learning such as medical data-mining and privacy preserving area. In future work, we wish to extend our methods to deal with continuous features and multi-class imbalanced problem. Additionally, ensemble methods have been proven to improve decision trees to solve imbalanced leaning effectively [47] and a new ensemble strategy based on optimizing decision directed acyclic graph is proposed to deal with multi-class classification problems [48], which inspires us to use ensemble methods to further improve the ability of HLazyDT and KLLazyDT to tackle imbalanced learning.