1 Introduction

A data set is imbalanced if the examples of one class outnumbers those of the others. In practice, the imbalance issue is often encountered by numerous real-world machine learning applications, such as text classification [13, 80, 81], speech recognition [46], software defect prediction [32, 34, 35, 61], and bioinformatics and biomedical decision making [48, 76].

The skew class distribution of imbalance data hinders the performance of most standard classification algorithms that work well on data sets with even class distribution, such as Naive Bayes, IB1, C4.5 [30, 74], Logistic Regression [70], Neural Networks and Support Vector Machines [30]. Therefore, the imbalance data problem has attracted much attention of the authoritative machine learning and data mining workshops such as AAAI’2000, ICML’2003 and ACM SIGKDD’2004, and a number of imbalance data dealing with methods have been proposed in the data level and the algorithm level.

The data level solutions concentrate on two points: 1) rebalancing the imbalanced class distribution by sampling or creating new examples, such as random undersampling (RUS) [41], random oversampling (ROS) [42], and synthetic minority over-sampling technique (SMOTE) [8]. It’s worth noting that there are still some issues with these solutions, such as some important information may be discarded by undersampling, and duplications and uncertainties introduced by oversampling might lead to overfitting and new problems; 2) feature selection [11, 34, 47, 69, 71, 72, 81] is used to avoid overfitting for the high dimensional imbalance data set. However, feature selection just reduces the dimensionality via removing the unimportant and useless features rather than enhances the discriminant ability of features in essence.

The algorithm level solutions primarily focus on exploring some suitable and robust classification algorithms for dealing with the imbalance learning problems, including one-class learning [64, 70, 75], ensemble learning methods [6, 33, 68], and cost-sensitive analysis [14]. But there are some issues with these methods, for instance, one-class learning fails to build a classifier when the features lack discriminant ability; too much computation time is consumed by ensemble learning methods, and an appropriate cost matrix needs to be determined beforehand by cost-sensitive analysis, etc.

In an imbalance problem, the most obvious characteristic is the skewed class distribution. Nevertheless, theoretical and experimental studies presented in [3, 30, 73] indicate that the skewed data distribution is not the only factor that influences the performance of a traditional classification algorithm in identifying rare events. Simultaneously, small sample size, high dimensionality and the problem complexity will hinder the learning performance as well, because it is difficult to build a good classification model over the high degree of features with limited samples. Essentially, a classification model is built based on the relationship between features and classes. When the imbalance data is high dimensionality and small size, the features are lack of the discriminant ability for classes, and further lead to the degradation in classification performance, specially on the minority class. Nevertheless, the existing solutions pay more attention on re-balancing the skewed class distribution or algorithm adaption but less on studying how to improve the discriminate ability of features in the imbalance data sets.

In order to build a better classification model for imbalance learning problems, it is necessary to construct new features with high discriminant ability instead of original features. Fortunately, Pekalska and Duin [17, 63] have presented a dissimilarity-based representation method which can improve the discriminant ability of features via pairwise dissimilarities between examples, because it not only captures the statistical information but also preserves the structural information of data sets. This dissimilarity-based representation method is originally proposed to depict the characteristics of unstructured or incomplete data sets, recently, it has been successfully applied to describe structural data sets and is capable of building good classifiers [58, 59, 62, 63], especially in pattern recognition. Inspired by these work, we believe that dissimilarity-based representation can be employed for handling the imbalance learning problems as well, although there is no research work on applying this method to solve the imbalance learning problems.

Based on the dissimilarity-based representation, the original data set is projected into the dissimilarity space, in which general classifiers are trained, i.e. the dissimilarity- based classification algorithm (DBC) [59]. The DBC primarily consists of tree parts, they are prototype selection, dissimilarity transformation and classification. Taking into account that those redundancy and useless features (such as noise or irrelevant features) in an imbalance data set might affect the quality and efficiency of prototype selection and dissimilarity transformation of DBC, we proposed an expanded dissimilarity-based classification method (EDBC) for solving the imbalance learning problems, in which feature selection is carried out beforehand to filter those unimportant features out of the original data sets. In our experimental study, three feature selection methods, three prototype selection methods and two distance measures are employed for accomplishing dissimilarity transformation on 24 imbalanced data sets; seven state-of-art solutions (RUS, ROS, SMOTE, Bagging, Boosting, MetaCost and EM1vs1) are compared with our proposed method EDBC in terms of AUC suggested in Refs. [25, 26] under five standard classification algorithms (Naive Bayes, Random Forest, IB1, Multilayer Perceptron and Logistic Regression). The experimental results show that our proposed method EDBC greatly improves the performance of classifiers on the imbalance data sets and stands out in comparison with the other solutions.

The rest of this paper is organized as follows: Section 2 reviews the previous research work related with the conventional DBC method. Section 3 presents the expanded dissimilarity-based imbalance classification method. Section 4 reports the experimental procedure and analyzes the results. Finally, Section 5 summarizes the study and draws the conclusion.

2 Previous work

In the past decades, the issues of the imbalance learning problem have been discussed and reviewed [2, 7, 9, 22, 25, 30, 40]. These methods can be grouped into two categories: in data level and in algorithm level.

The data level methods focus on adjusting the original imbalanced class distribution via sampling or generation of new examples or reducing the dimensionality of the high dimensional imbalance data sets by feature selection. More popular solutions that use various types of sampling methods were explored to alter the skew class distribution. Random under-sampling [41] randomly deletes some examples of the majority class, meanwhile it also removes some information important for afterwards classification. Random over-sampling [42] replicates the minority class examples. Since over-sampling makes exact copies of the minority class examples, the duplication leads to overfitting. To overcome this problem, synthetic minority over-sampling technique (SMOTE) [8] was proposed to over-sample the minority class by creating new synthetic minority examples. For the high dimensional imbalance data set, to avoid overfitting, feature selection [11, 34, 71, 72, 81] is confirmed to be more important than the choice of the classification algorithm.

The algorithm level methods concentrate on the modification of the existing classification algorithms to suit for dealing with the imbalance learning problems, including cost-sensitive learning, recognition-based learning and ensemble learning. 1) Cost-sensitive learning [45] approaches to minimize the total misclassification cost via adjusting the misclassification cost for each class. MetaCost [14] is one classical algorithm of this kind, which makes an arbitrary classification algorithm cost-sensitive via wrapping a cost-minimizing procedure around. 2) Recognition-based learning [64], Ripper [70, 75] and auto association [29] provide the discrimination model created on the examples of the target class alone. They have been proved to be particularly useful on extremely unbalanced data sets composed of a high dimensional noisy feature space. 3) Ensemble learning is often employed to reduce the variance and bias through summarizing the results of many classification algorithms on the imbalanced data. Representatively, Bagging [6] produces an aggregated predictor to strengthen the classification ability of one base classification algorithm via generating multiple versions of a classification algorithm. Boosting [40] aims to identify the accurate weights for all training examples by iteratively adjusting them according to the classification results. Chawla et al. [10] proposed to combine the SMOTE method with Boosting in order to balance the skewed class distribution and then learn better and broader decision regions for the minority class. Additionally, a novel coding-based multiclass algorithm [68] was presented to convert the imbalanced binary class problem into a balanced multi-class problem and then build a binary classification algorithm on each pair of two classes with the one-against-one coding scheme.

From the solutions mentioned above, we know that the existing solutions pay more attention on adjusting the skewed class distribution or exploring new algorithms but less on improving the discriminant ability of features for imbalance learning. In fact, if the original features are lack of discriminant ability, then it is not powerful enough to only carry out feature selection. Alternatively, replacing the original features, Pekalska et al. [63] proposed that the dissimilarities to a subset of the examples in the historical data are more effective for representing a data set, because it not only reserves the statistical information but also captures the geometry and structure information of the data set. Duin et al. [17] first introduced the relational discriminant analysis method in view of a proximity description of data. After that, Duin and Pekalska [52, 57, 58, 62, 63] demonstrated that dissimilarity-based representation can improve classification performance and provided the framework of the DBC method consisted of prototype selection, dissimilarity transformation and classification. To optimize the DBC method, Kim et al. [3739] proposed to use prototype reduction schemes for improving the quality of prototype selection. In the recent years, the DBC method has been recognized and widely applied in various fields of real word, such as detecting the seismic signals [49], face recognition [50] and medical image computing [67], etc. However, there is no research work on applying this method for solving the imbalance learning problem. To build a good classifier, we employ the dissimilarity-based method to classify the imbalance data sets.

3 Dissimilarity-based imbalance data classification method

In this section, we first provide an overview of the proposed method in Section 3.1, and then respectively state each step of the proposed expand algorithm in detail, involving feature selection in Section 3.2, prototype selection in Section 3.3 and dissimilarity transformation in Section 3.4.

3.1 Overview of the proposed method

In the traditional way of imbalance learning, the classifiers are built in the original feature space. However, an alternative way is to construct classification models on dissimilarity representations, in which each example is described by pairwise dissimilarity relations between examples in original data sets and the representative examples. This way becomes especially useful when the original data is described by many features or when experts cannot formulate the attributes explicitly, because they are able to provide a dissimilarity measure which can be considered as a connection between perception and higher-level knowledge, being a crucial factor in the process of human recognition and categorization [18, 21].

The original dissimilarity-based classification algorithm (DBC) consists of prototype selection, dissimilarity transformation and classification. Using this method to classify the imbalance data sets, the redundancy and irrelative features in the data sets may compromise the performance of prototype selection methods and even disturb the dissimilarity transformation, finally results in bad classification performance. Aiming to further alleviate the disturbance of useless features on prototype selection and dissimilarity transformation of the dissimilarity-based classification algorithm, we expand the original DBC method with feature selection as the expanded dissimilarity-based Classification algorithm (EDBC), which consists of two parts: Model Construction and Classification. Figure 1 shows the details.

Fig. 1
figure 1

The framework of the proposed method

  1. 1.

    Model Construction

    For a given historical imbalance data set, an important feature subset to the target concept is firstly selected (i.e. Feature Selection), and then the data is reduced via reserving these selected features only (i.e. Data Reduction). Secondly, the representative examples are selected from the reduced data for each class and a prototype set is obtained (i.e. Prototype Selection), and the reduced data is projected into the dissimilarity space via computing the dissimilarity between examples of the reduced data and the prototype set (i. e. Dissimilarity Transformation). Finally, the classification model is constructed by a specific classification algorithm on the dissimilarity-based imbalance data set.

  2. 2.

    Classification

    For a new imbalanced data set, with the feature subset and the prototype set selected in Model Construction, its dimensionality is reduced and the corresponding dissimilarity-based data set is created via computing dissimilarities between examples in prototype set and training data. Finally, the classification model built in the Model Construction is employed to classify the dissimilarity-based imbalance data set.

3.2 Feature selection

When learning imbalance problems, some redundancy and useless features existed in the imbalance data sets may hinder the generalization ability of the given learning algorithm [79]. It is inadequate to handle the high dimensional imbalance data sets only via employing the sampling techniques and adapted algorithms, Putten and Someren [69] carried out the experiments on the high dimensional imbalance data sets in the CoIL Challenge 2000 project, and they draw the conclusion that the feature selection is much more important than the selection of the classification algorithm, which contributes to avoiding over fitting.

In the preparation stage, the purpose of employing the feature selection [23] is to filter those irrelative and redundancy features, in order to alleviate and even avoid the curse of dimensionality, reduce storage and memory requirements, increase mining accuracy, and even enhance the comprehensibility of the classification results. According to the evaluation process of feature selection, the feature selection strategy can be divided into three types, they are filter, wrapper and embedded. The wrapper and embedded feature selection methods evaluate the feature subset depending on the performance of the special classification algorithm. Although they can get a valid feature subset supporting for the classification algorithm, they lack of generalization and efficiency due to the high complexity and the strong dependence on the classification algorithm. On the contrary, filter methods are independent to the classification algorithm, they get the feature subset via scoring the correlation, χ 2, the information gain and the symmetrical uncertainty between features with lower complexity and reduce the possibility of over fitting.

Aiming to offset the pernicious effects from the useless features on the afterward prototype selection and dissimilarity transformation process, rather than improve the classification algorithm performance directly, so the filter-based feature selection methods are chosen to expend the original DBC method. For solving the class imbalance learning problems, some popular filter-based metrics have been systemically studied and compared with each other [11, 19, 34, 71, 72, 77, 81], in this paper, we adopt three classical feature selection methods to improve the dissimilarity-based classification algorithm, they are correlation-based Feature Selection (CFS) [24], FCBFS [78] and FAST [66].

CFS [24] evaluates the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them. Subsets of features that are highly correlated with the class while having low intercorrelation are preferred.

FCBFS [78] is a fast filter method which can identify relevant features as well as redundancy among relevant features without pairwise correlation analysis, in which the symmetrical uncertainty attribute evaluator is used to evaluate the correlation between features.

FAST [66] is a clustering-based feature selection method with high efficiency and effectiveness. It first adopts the efficient minimum-spanning tree clustering method to divide the features into clusters, and then select the most representative features that are strongly related to the target concept as the feature subset. The clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features.

3.3 Prototype selection

In the DBC method, prototypes, which are the representative examples characterizing the typicality of a class, are selected as the references for dissimilarity transformation. Aiming to get such a prototype set, a number of methods have been investigated, such as the random selection methods RandomC [60] and KCentres [16], the mode seeking methods ModeeSeek [12] and LinProg [5], and the feature selection methods FeaSel [27], KCentres-LP and EdiCon [57].

Of these methods, random selection [51, 52, 56, 60] is first proposed and it is the most simple methods for prototype selection, because it just needs to sample randomly with replacement for the specified times (the number of prototypes needs to be extracted). Furthermore, it has been validated to have the capability of working well whenever it is implemented globally or separately for each class. Pekalska, et al. [57] carried an extensive experimental comparisons of some available prototype selection methods in the dissimilarity-based classification algorithm, they suggested that a systematic prototype selection method may be better than random selection and discovered that KCentres performs well in general.

Apart from random selection and KCentres, the clustering-based Jarvis-Patrick clustering (JPC) algorithm [31, 53] is a good choice to be used for selecting prototypes from the imbalance data set in the proposed EDBC algorithm. JPC is good at dealing with noise, outliers, clusters of different sizes, shapes and densities. High-dimensional data is particularly good at finding tight clusters of strongly related examples. In the JPC algorithm, the shared nearest neighbor (SNN) similarity is used to measure the proximity between examples instead of direct similarity, such as distance measures and correlation coefficient. The SNN similarity is useful since it addresses some problems that occur with direct similarity. At the same time, the SNN similarity takes into account the context of an example via the number of shared nearest neighbors, so it can handle the situation in which an example happens to be relatively close to another example but belongs to a different class.

figure g

The prototypes are diverse with different prototype selection methods. Assume there are i classes in imbalance data set: ω 1,ω 2,...,ω i . Let D t be a training set and let D i be the training data consisted of examples belong to the class ω i . Each method selects R examples for the prototype set P. The above algorithms are all applied to each class separately, then r i examples are chosen such that \(R=\sum r_{i}\). The detailed prototype selection procedures are described as follows:

  1. 1.

    Random Selection (RC). Random selection of r i examples from the training data set D i of class ω i and get the representation set R i with replacement, the final representation set \(R=\sum R_{i}\).

  2. 2.

    KCentres (KC). This algorithm is applied to each class ω i . It tries to choose r i examples from the class ω i . The algorithm proceeds as below:

    1. (1)

      Randomly select an initial set \(R_{i}={\{{p_{1}^{i}}, {p_{2}^{i}},..., p_{ri}^{i}\}}\) consisted of r i examples from the i th training data set D i ;

    2. (2)

      For each xD i , find its nearest neighbor in R i . Let J j , j=1,2,...,r i , be a subset of D i consisting of examples that owns the same nearest neighbor \({p_{j}^{i}}\) in R i ;

    3. (3)

      For each J j find its center c j , that is the example for which the maximum distance to all other examples in J j is minimum, that is the radius of J j ;

    4. (4)

      For each center c j , if \(c_{j}\neq {p_{j}^{i}}\), then \({p_{j}^{i}}\) is replaced by c j in R i . If an replacement is done, then return to (2) step, otherwise stop the iteration. The final representation set R consists of all sets R i .

  3. 3.

    Jarvis-Patrick clustering (JPC). For each class ω i , it works as follows:

    1. (1)

      Compute the SNN similarity between two points of ω i with Algorithm 1, and connect those pairs of examples with nonzero SNN similarity, finally the SNN similarity graph is obtained;

    2. (2)

      Sparsify the SNN similarity graph via cutting down the links between examples whose SNN similarity is smaller than the given threshold;

    3. (3)

      Find the connected components (clusters) of the sparsified SNN similarity graph.

    4. (4)

      Select the centers of clusters to create the prototype set R i for the class ω i . The final prototype set R consists of all sets R i .

In the clustering process of KC and JPC, the Euclidean distance [54] is used to measure the proximity between examples as default option. Practically, there are many other proximity measures, such as Jaccard coefficient, cosine similarity, the extended Jaccard coefficient, Dynamic Time Warping (DTW) [4, 65] and Optimal Subsequence Bijection (OSB) [43], etc. Noting that, the type of proximity measure should fit the type data. For many types of dense, continuous data, metric distance measures such as Euclidean distance are often used. For the sparse, asymmetric data, it is more suitable to employ the cosine, Jaccard measures and the extended Jaccard coefficient. DTW and OSB are appropriate to be used for measuring the proximity of examples in time series. Besides, for searching similar web pages, V. Loia, et al. [55] proposed a proximity fuzzy C-means (P-FCM) incorporating a measure of similarity or dissimilarity as user’s feedback on the clusters, in which the Euclidean distance is used within the standard FCM algorithm.

Furthermore, Andy and Matthew [44] proposed that the (i,j) element of the proximity matrix produced by Random Forest can used to represent the similarity between examples i and j, they regarded the examples in the same terminal nodes as the similar observations. However, the issues are that some examples may be misclassified into wrong terminal nodes when building Random Forest to be used for each class, and it is difficult to use this method for each class.

Aiming to select the representative examples for each class, the simple and valid proximity measure is preferred. In this paper, the EDBC algorithm is proposed to solve the structural binary imbalance data sets, which are often consisted of dense and continues attributes, so the Euclidean distance is employed to measure the proximity between examples in the cluster-based prototype selection methods KC and JPC.

3.4 Dissimilarity transformation

In the dissimilarity transformation showed in Fig. 2, the reduced data is projected into the dissimilarity space via pairwise dissimilarity relations between examples in the reduced data and the prototype set, where the metric distance measure is often employed to represent the dissimilarity between examples.

Fig. 2
figure 2

The procedure of dissimilarity transformation

Assuming that D={x 1,x 2,...,x n } is the reduced data consists of n examples, where x i ={a i1,a i2,...,a i m ,c i } is the ith example with m+1 attributes (m independent attributes and the class attribute); P={p 1,p 2,...,p r } denotes the prototype set of r representative examples, where \(p_{i} = \{a^{\prime }_{i1}, a^{\prime }_{i2},..., a^{\prime }_{im}, c^{\prime }_{i}\}\) is the ith prototype selected from the reduced data D. Then the dissimilarity-based data \(\hat {D} = \{\hat {x}_{1}, \hat {x}_{2}, ..., \hat {x}_{n}\}\) can be obtained as follows:

  1. 1.

    \(\hat {D}\) consists of n examples and each example \(\hat {x}_{i} = \{\hat {a}_{i1}, \hat {a}_{i2},..., \hat {a}_{ir}, c_{i}\}\) is described with r attributes;

  2. 2.

    For the example \(\hat {x}_{i}\), \(\hat {a}_{ij}\) is the dissimilarity between the example x i and the prototype p j , it is computed by

    $$\hat{a}_{ij} = dis(x_{i},p_{j}) = (\sum\limits_{k=1}^{m}|a_{ik}-a'_{jk}|^{l})^{\frac{1}{l}}. $$

    Where l is a positive integer, it can be 1, 2, ..., \(\infty \).

    l=1. Manhattan distance (Hamming distance, City block, taxicab and L 1 norm) .

    l=2. Euclidean distance (L 2 norm) are often used for measuring the dissimilarity of dense, continuous data.

    \(l=\infty \). Supremum (L m a x or \(L_{\infty }\) norm) distance. This is a the maximum difference between any attribute of the examples.

A distance measure d is called a metric when it fulfills the following conditions:

  1. Reflectivity: d(x,x)=0.

  2. Positivity: d(x,y)>0 if x is distinct from y.

  3. Symmetry: d(x,y)=d(y,x).

  4. Triangle inequality: d(x,y)<d(x,z)+d(z,y) for every z.

As Pekalska, et al. [60] suggested that, reflectivity and positivity are crucial to define a proper dissimilarity measure. It is unacceptable when a dissimilarity measure is zero for two different objects, since it would violate the compactness hypothesis [1, 15], which states that objects that are similar, are also close in their representations. Besides, it is difficult to interpret the negative dissimilarities between examples. Therefore, in the dissimilarity transformation, the metric Euclidean distance and Manhattan distance based on sums of differences between measurements are adopted to measure the dissimilarity between examples. After dissimilarity transformation, the reduced imbalance data set is projected into the dissimilarity space, in which the classification model is built on the imbalance data sets.

3.5 Complexity of the proposed method

The proposed EDBC consists of feature selection, prototype selection, dissimilarity transformation and classification, thus its complexity depends on the sum of the complexity of the method adopted in each step mentioned above.

Supposing an historical imbalance data set D with n instances and m features, then the complexity of each step in EDBC algorithm is listed as below:

  1. (i)

    Complexity of feature selection

    Since the feature selection is employed to improve the quality of prototype selection and dissimilarity rather than effect the classification algorithm directly, the filter-based feature selection methods are selected to expend the dissimilarity-based classification algorithm. The filter-based feature selection aims to remove those redundancy and irrelative features. In the process of feature selection, the computation overhead is taken on evaluating the correlation between each feature and the target feature, and that between features, thus the complexity of feature selection is T F S =O(m 2).

  2. (ii)

    Complexity of prototype selection

    When using different methods to select r prototypes, their complexity are diverse. Random selection method just proceeds sampling with replacement for r times and then obtains the r prototypes, its complexity is T R C =O(r). Prototype selection with KCentres method is accomplished by iterative clustering for t times until the r centers become fixed, its complexity is T K C =O(rnt). When applying Jarvis-Patrick clustering to select prototypes, its computation overhead is mainly taken on the calculation of the dissimilarity between examples, thus its complexity is T J P C =O(n 2).

  3. (iii)

    Complexity of dissimilarity transformation

    In the process of dissimilarity transformation, the primary task is to project the reduced data set into the dissimilarity space via computing the dissimilarity between each examples in the reduced data and r prototypes, so the complexity of dissimilarity transformation is T D T =O(nr).

Summarizing the complexity of feature selection, prototype selection, dissimilarity transformation and classification, we can achieve the total complexity of the proposed EDBC algorithm T E D B C =T F S +T P S +T D T +T C , that is T E D B C =O(m 2)+O(r)/O(rnt)/O(n 2)+O(nr)+O(C(n,r)), in which T P S and T C respectively denotes the complexity of prototype selection and classification on the reduced data, and T C =O(C(n,r)).

According to the basic principle of each other imbalance solutions, we could deduce that the complexity of RUS is \(T_{RUS}=O(C_{min})+O(C(n^{\prime },m)),n^{\prime }=2C_{min}\), the complexity of ROS is \(T_{ROS}=O(C_{max})+O(C(n^{\prime },m)),n^{\prime }=2C_{max}\), the complexity of SMOTE in which the minority class is over-sampled by creating ”synthetic” examples rather than by over-sampling with replacement is \(T_{SMOTE}=O(SMOTE(k,n^{\prime },t))+O(C(n^{\prime },m)),n^{\prime }=2C_{max}\), in which k is the number of nearest neighbors in minority class and t denotes the iterations, and the complexity of Bagging, Boosting, MetaCost and EM1vs1 all depends on the iterations of ensemble learning, which can be represented as T E n s e m b l e =O(tC(n,m)).

Since RUS, ROS and SMOTE just proceed the random sampling of majority and minority class with replacement or generation of new minority examples, without further computation, their complexity may be lower than that of the proposed EDBC algorithm. On the contrary, those ensemble learning methods are accomplished via repeated iteration analysis, which are very time-consuming, so their computational complexity will be far higher than that of the proposed EDBC algorithm.

4 Empirical study

In this section, we carry out the extensive experiments on the public imbalance data sets in order to validate the effectiveness of our proposed EDBC algorithm. At first, we introduce the statistical information of the empirical imbalance data sets in Section 4.1. Secondly, we design the contents of experiments so as to comprehensively analyze the performance of EDBC in Section 4.2. Thirdly, we set the available methods for each step of EDBC in detail in Section 4.3. Fourthly, we described the detailed experimental process in Section 4.4. Finally, we analyze and discuss the experimental results corresponding to the investigations proposed in experimental design in Section 4.5.

4.1 Data sets

For the purpose of evaluating the performance and effectiveness of the proposed method EDBC and allowing other researchers to confirm our experimental results, we collect 24 binary imbalanced data sets for classification, in which half of the data sets are available from UC Irvine Machine Learning Repository [20] from different areas (biology, medicine and software engineering), and the rest are software defect prediction data sets in the form of some metrics derived of software source codes.

Table 1 shows the detailed statistical information of these data sets. I, F respectively denotes the number of instances and the number of features. C m i n , C m a j and IR respectively represents the number of examples in the minority class, the number of examples in the majority class and the imbalance ratio with the meaning of how skewed the class distribution of each data sets is, which is calculated from the ratio of the corresponding C m a j and C m i n of each data set and. Aiming to confirm the effectiveness of our proposed EDBC algorithm on the imbalance data sets, the imbalance ratio of 24 empirical data sets we collected are greater than 3.

Table 1 Summary of 24 imbalanced binary data sets

4.2 Experimental design

In the experiment, two investigations are conducted to evaluate the classification performance of our proposed EDBC algorithm.

  1. (1)

    Investigation 1: Can the EDBC algorithm improve imbalance learning with conventional classification algorithms?

    This investigation aims to explore whether the EDBC algorithm can improve the imbalance learning performance with the conventional classification algorithms and how the improvement on imbalance classification performance with EDBC compared with other imbalance handling methods. Aiming to achieve this purpose, we respectively compare the performance of each given classification algorithm with the proposed EDBC algorithm to that on the original empirical data sets and those with other popular imbalance solutions.

  2. (2)

    Investigation 2: What and how factors will affect the performance of EDBC as the imbalance ratio alters?

    This investigation aims to learn what factors impact the performance of EDBC and how the performance of EDBC will be affected by each factor as the imbalance ratio changes.

    In the proposed EDBC algorithm, there are four factors may affect its classification performance. They are the determinations of feature selection methods, prototype selection methods, number of prototypes and dissimilarity measures. When observing the effect from each factor, it is advisable to compare the differences in classification performance of the EDBC algorithm with various settings of one factor under the condition of keeping other factors fixed. And so on, we can obtain the effect from all factors on EDBC when solving the imbalance learning problems.

4.3 Experimental setup

In this section, we individually setup the methods adopted in each step of EDBC, they are feature selection methods, prototype selection methods and the number of prototypes, the distance measures in dissimilarity transformation and the classification algorithms.

  1. (1)

    Feature selection methods

    In the process of feature selection, three representative feature selection methods were applied to choose those important features from the raw training data, they were the classical correlation-based feature selection CFS, the correlation-based filter solution FCBFS and the clustering-based feature subset selection algorithm FAST.

  2. (2)

    Prototype selection methods

    In the process of prototype selection, three prototype selection methods are employed to select the representative examples, including the random sampling for each class (RC), the KCentres (KC) and the Jarvis-Patrick Clustering (JPC).

    In the process of prototype selection, the number of prototypes determines the dimensionality of the new imbalance data mapped in dissimilarity space. Too small, a much lower dimensionality will lead to overfitting; too large, it will increase the complexity of computation and introduce some similar prototypes, on the contrary.

    As suggested by Pekalska, et. al [57], the number of prototypes should be in the range of 3−10 % of training data. Meanwhile, considering a trade off between classification performance and computation complexity, we set the numbers of selected prototypes to be r=L o g I (I is the number of instances)[36] and r=20 for each training data.

  3. (3)

    Distance measures

    In the process of dissimilarity transformation, Euclidean distance and Manhattan distance are utilized to compute the pairwise dissimilarity between examples in the reduced imbalance data set and the set of prototypes.

  4. (4)

    Classification algorithms

    After the dissimilarity transformation, the classification models are built on the new imbalance data sets in the dissimilarity space. In the dissimilarity classification, five traditional classification algorithms are applied to construct the classification model on each imbalance data set, involving the probability-based Naive Bayes (NB), the instance-based nearest neighbor (IB1) [30, 74], the tree-based Random Forest (RF) [36], the network-based Multilayer Perceptron (MLP) [25, 28, 30] and the linear-based Logistic Regression (LR) [70], which have been implemented in WEKA.

figure h

4.4 Experimental process

In the experimental process, we employ the EDBC algorithm on each binary imbalanced data set with all combinations generated by three kinds of prototype selection methods, three types of feature selection methods, two classical metric distance measures and five traditional classification algorithms. The details of our experimental process is described in Procedure Experimental Process.

Aiming to further validate the effectiveness of the proposed EDBC algorithm, we additionally apply seven popular imbalance handling methods to learning on the empirical data sets and then acquire their performance on each data set for the subsequent comparison with EDBC. The seven imbalance solutions are random under sampling (RUS), random over sampling (ROS), synthetic minority over-sampling technique (SMOTE), Bagging, Boosting, cost-sensitive learning (MetaCost) and EM1vs1.

For evaluating the performance of the proposed EDBC algorithm and other imbalance solutions, the 5×10 cross-validation strategy is realized in the experimental process. For each 10-fold cross-validation, one raw imbalance data was randomly divided into 10 equal folds and EDBC is trained on the rest of the nine folds and tested on the specified fold for each fold. Aiming to obtain reliable and stable classification performance, the 10-fold cross-validation strategy is repeated for 5 times and examples are randomly ordered for each iteration. The average AUC of 50 iterations is used as the measure for evaluating the classification performance of EDBC on the imbalance data sets.

4.5 Experimental results and analysis

In this section, we first present the results of performance comparison between our proposed method EDBC and other existing imbalance solutions in Section 4.5.1, as a response to the question proposed in Investigation 1, and then we analyze the impact of employing different setting for each step on the performance of the proposed EDBC algorithm in Section 4.5.2 with the intend to answer the question raised by Investigation 2.

4.5.1 Performance comparison

For each given classification algorithm, we extensively compared its classification performance with EDBC and that with other seven imbalance data handling methods (RUS, ROS, SMOTE, Bagging, Boosting, MetaCost and EM1vs1) in terms of AUC. Tables 2-6 shows the details of comparison results for Naive Bayes, Random Forest, IB1, Multilayer Perceptron and Logistic Regression, respectively.

Table 2 shows the classification performance of Naive Bayes on each imbalance data set with different imbalance data handling methods. From it we observe that, 4 out of 8 imbalance data handling methods can improve the performance of Naive Bayes, and EDBC performs best. Compared with SMOTE and EM1vs1, the performance of Naive Bayes has been improved by EDBC by 7.69 %; compared with RUS and ROS, the performance of Naive Bayes has been improved by EDBC by 6.33 %; compared with Bagging, the performance of Naive Bayes has been improved by EDBC by 5 %; and compared with Boosting and MetaCost, the performance of Naive Bayes has been improved by EDBC by 9.09 %.

Table 2 The AUC values of different imbalance data handling methods with Naive Bayes

Table 3 shows the classification performance of Random Forest on each empirical imbalance data set with different imbalance handling methods. From this table, we could find that 4 out of 8 imbalance handling methods improved the performance of Random Forest, in which both EDBC and Bagging are the best methods with the same average AUC. Furthermore, comparing the proposed EDBC algorithm with the other six imbalance solutions, the classification performance of Random Forest with EDBC is improved by 3.70 % compared to that with RUS, ROS and MetaCost, by 6.33 % compared to that with Boosting, by 2.44 % compared to SMOTE and by 1.2 % compared to EM1vs1, respectively.

Table 3 The AUC values of different imbalance data handling methods with Random Forest

Table 4 reveals the classification performance of IB1 on each imbalance data set with 8 different imbalance handling methods. From it we can observe that, 7 out of 8 imbalance handling methods greatly increased the performance of IB1 on the imbalance data sets, in which the proposed EDBC algorithm ranks the first place with the highest classification performance on average. Relatively, the classification performance of IB1 with EDBC is higher by 18.57 % than that with ROS, by 15.28 % than that with RUS, by 16.90 % than that with SMOTE and MetaCost, by 11.69 % than that with Bagging, by 12.16 % than that with Boosting and by 9.21 % than that with EM1vs1, respectively.

Table 4 The AUC values of different imbalance data handling methods with IB1

Table 5 discloses the classification performance of Multilayer Perceptron on each imbalance data set with different imbalance solutions. From the table we observe that, all the imbalance data handling methods can improve the classification performance of Multilayer Perceptron on the imbalance data sets, among which our proposed EDBC algorithm ranks the first with the highest AUC. Individually, the classification performance of Multilayer Perceptron with EDBC is increased by 7.5 % compared with RUS, Boosting and EM1vs1, by 8.86 % compared with ROS and MetaCost, by 10.26 % compared to SMOTE and by 4.88 % compared to Bagging.

Table 5 The AUC values of different imbalance data handling methods with Multilayer Perceptron

Table 6 uncovers the classification performance of Logistic Regression with 8 different imbalance learning handling methods. From it we observe that, 6 out of 8 imbalance data handling methods increased the performance of Logistic Regression, where EDBC ranks the first again with the best classification performance. By comparison, the classification performance of Logistic Regression with EDBC has been improved by 13.16 % for RUS and ROS, by 11.69 % for SMOTE, Boosting and MetaCost, by 8.86 % for Bagging and by 10.26 % for EM1vs1, respectively.

Table 6 The AUC values of different imbalance data handling methods with Logistic Regression

To sum up, our proposed EDBC algorithm has a capability of building a more effective classification model when solving the imbalance learning problems. Besides, compared with other seven imbalance data handling methods, it is always the most outstanding of all.

For the purpose of validating whether our proposed EDBC algorithm significantly outperforms the other solutions with all classification algorithms, the Wilcoxon signed-rank test was conducted under the significance level 0.05. The alternative hypotheses are that for each classification algorithm, the EDBC method is superior to the compared methods.

The p-values of the hypotheses are all significantly lower than 0.05 except Random Forest. This means that the EDBC algorithm is significantly superior to the other methods with four classification algorithms Naive Bayes, IB1, Multilayer Perceptron and Logistic Regression. When using Random Forest as the classification algorithm, both of EDBC and Bagging are the best imbalance solutions and there is no significant differences between them.

To provide an overall performance evaluation of all given imbalance handling methods, we rank these solutions for each classification algorithm according to the average AUC of 24 imbalance data sets separately, the best performing algorithm getting the rank of 1, the second best rank 2..., as shown in Table 7. In case of ties, the same rank is assigned like both of RUS and ROS ranking 3rd, Bench, SMOTE and EM1vs1 all ranking 5th. From this table we observe that our method EDBC ranks the first place with the minimum average rank. It means that the proposed EDBC method markedly improves the performance of the standard classification algorithms on the imbalance learning problems and surpasses all the other existing solutions.

Table 7 Rank of all imbalance handling methods with different classification algorithms

4.5.2 Sensitivity analysis

As well known that, the kernel of EDBC algorithm consists of feature selection, prototype selection, dissimilarity transformation, so the the methods employed in feature selection and prototype selection, the number of prototypes and the dissimilarity measure adopted in dissimilarity transformation are the crucial influence factors for the classification performance of EDBC.

In this section, we respectively analyze and discuss the effect from each factor on the proposed EDBC algorithm. The details of each factor are illustrated in Figs. 3, 4, 5, and 6, where X-axis denotes the imbalance ratio of each empirical data set and Y-axis represents the classification performance AUC of EDBC with one specified factor.

  1. 1

    Sensitivity Analysis of Feature Selection Methods.

    Fig. 3 shows the effect from four features selection strategies on the classification performance of the proposed EDBC algorithm for each given classification algorithm as the imbalance ratio altering. From the figure we observe that:

    1. (i)

      For each given classification algorithm, the classification performance of EDBC is significantly affected by different feature selection methods. Generally, FAST performs best of all feature selection methods, while there is no significant improvement and even a little degradation on the classification performance of EDBC using CFS and FCBFS for feature selection. It means that the fast clustering-based feature subset selection method (FAST) is able to find a feature subset that is most relative to the class concept from each imbalance data set, which is conducive to improve the afterward prototype selection and dissimilarity transformation.

    2. (ii)

      From the fluctuation trend of the classification performance of EDBC with different feature selection methods, we find that the effect of feature selection on EDBC is very significant when I R<16 and it is getting stable and weaker as IR increases. Although, FAST significantly outperforms other feature selection methods within EDBC for all given classification algorithms when I R<16, the differences in the performance of EDBC with all feature selection methods also become not so obvious as IR creases. That means it is difficult to select those salient features from the extremely imbalance data set with the given feature selection methods.

    3. (iii)

      Compared to the situation without feature selection (nonFS), the classification performance of EDBC using FAST for feature selection is increased by 11 % for Naive Bayes, 12 % for Random Forest, 23 % for IB1, 6 % for Multilayer Perceptron and 7 % for Logistic Regression, respectively.

    Fig. 3
    figure 3

    The sensitivity analysis of four feature selection methods

    From above analysis on the effect from different feature selection methods, we have summarized that feature selection plays a critical role in the proposed method EDBC, especially the FAST method. Additionally, it is necessary to adopt a more valid feature selection method for the the extremely imbalanced data sets.

  2. 2

    Sensitivity Analysis of Prototype Selection Methods.

    Fig. 4 displays the effect from three prototype selection methods RC, KC and JPC within EDBC on the classification performance of EDBC as the imbalance increases. From the figure we observe that:

    1. (i)

      The classification performance of EDBC is significantly affected by different prototype selection methods, in which RC and JPC within EDBC achieve the similar classification performance as imbalance increases and both of them are superior to KC on all imbalance data sets.

    2. (ii)

      When I R<16, the performance of EDBC with RC and JPC fluctuates widely and tends towards stability as IR increases. On the contrary, the performance of EDBC with KC is relatively stable and worse overall.

    3. (iii)

      Compared to KC, the classification performance of EDBC with RC and JPC is averagely increased by about 12 % for Naive Bayes, 11 % for Random Forest, 15 % for IB1 and Logistic Regression and 14 % for Multilayer Perceptron, respectively.

    Fig. 4
    figure 4

    The sensitivity analysis of three prototype selection methods

    In summary, KC is not suitable for clustering on imbalance data sets, because KC cannot handle non-globular clusters or clusters of different sizes and densities and the data sets containing outliers method, which is the nature of the imbalance data set. While RC proceeds the stratified sampling as the class distribution and JPC can compensate the disadvantages of KC, so RC and JPC are very appropriate for selecting the representative examples from imbalance data sets. Furthermore, the advantages of RC and JPC become more and more obvious as IR increases.

  3. 3

    Sensitivity Analysis of the Numbers of Prototypes.

    Fig. 5 shows the impact of the number of prototypes r=L o g I and r=20 on the classification performance of EDBC for different classification algorithms. From it we find that, i) the classification performance of EDBC with different numbers of prototypes has the similarity trend as IR increases; ii) When r=20, the classification performance of EDBC is averagely higher than that when r=L o g I by 5 % for all classification algorithms but Random Forest and iii) when I R>16, the difference between two conditions does not become significant.

    Fig. 5
    figure 5

    The sensitivity analysis of two numbers of prototypes

    It means that the classification performance of EDBC will be degraded with only a few prototypes, because it is difficult to distinguish the examples projected into the dissimilarity space referring a few poor prototypes. Additionally, it also implies that EDBC is not so sensitive to the number of prototypes as IR increases. In the practice, taking the computational complexity and classification performance into account, only if the number of prototypes lies in the rational range 3-10% of the data size as suggested by Pekalska, et. al [57], the EDBC algorithm will obtain a stable and approving classification performance.

  4. 4

    Sensitivity Analysis of Dissimilarity Measures

    Fig. 6 illustrates the effect from two distance measures (Euclidean distance and Manhattan distance) on the classification performance of the proposed EDBC algorithm. From the figure we observe that:

    1. (i)

      The classification performance of EDBC with Euclidean distance as the dissimilarity measure is superior to that with Manhattan distance. It means that Euclidean distance reflects the real distance between two points, and even profits the construction of classification model in the dissimilarity space. While Manhattan distance is the sum of the lengths of the projections of the line segment between two points, which maybe lead to overlapping of multiple points in the dissimilarity space, e.g. there are many pathes with the same Manhattan distance between two points. It is difficult for EDBC to distinguish those overlapped points in the dissimilarity space via pairwise Manhattan distance.

    2. (ii)

      The classification performance of EDBC with Euclidean is more volatile than that with Manhattan distance when I R<16, and it tends to stabilize as IR increases. Meanwhile the proposed method EDBC with Euclidean distance outperforms that with Manhattan distance by 6 % for Naive Bayes, 8 % for Random Forest, 9 % for IB1, 7 % for Multilayer Perceptron and 10 % for Logistic Regression on average, respectively. As IR increases, the differences between the performance of EDBC with both dissimilarity measures become not so obvious. It implies that the effect from dissimilarity measure on the classification performance of EDBC dies away due to the highly imbalanced class distribution.

    Fig. 6
    figure 6

    The sensitivity analysis of two dissimilarity measures

    Generally, Euclidean distance is more suitable to be used for dissimilarity transformation for EDBC than Manhattan distance.

5 Conclusions

The imbalance learning problem is one challenge of data mining domain. Skewed class distribution often degrades the performance of most traditional classification algorithm, such as Naive Bayes, IB1, C4.5, Logistic Regression, Neural Networks and Support Vector Machines, etc. A number of solutions have been proposed for improving the imbalance classification performance, but they paid more attention on how to rebalance the skewed class distribution and how to find a more suitable classification algorithm, ignoring how to improve the discriminant ability of features in order to increase the performance of traditional classification algorithms on imbalance data sets, essentially.

In this perspective, we have proposed an expanded dissimilarity-based classification algorithm (EDBC) for classifying the imbalance data sets, which proceeds as below:

  1. 1.

    Remove the useless and redundancy features from the original imbalance data via feature selection, aiming to mitigate the effect on the quality of afterward prototype selection and dissimilarity transformation;

  2. 2.

    Select some representative examples for each class from the reduced data, and then produce a prototype set;

  3. 3.

    Project the reduced data into the dissimilarity space through computing the dissimilarity between examples in the reduced data and the prototype set;

  4. 4.

    Construct the classifier on the new data set in the dissimilarity space.

Different from the existing imbalance data handling methods, the proposed EDBC resolves the imbalance learning problems fundamentally via enhancing the discriminant ability of features with the help of the dissimilarity-based representation. For the purpose of confirming the effectiveness and the efficiency of the proposed EDBC algorithm, we have carried out an extensive empirical study on 24 benchmark imbalance data set, evaluated the performance of EDBC algorithm in terms of AUC and compared it with other commonly used imbalance solutions, including Random under-sampling, Random over-sampling, SMOTE, Bagging, Boosting, MetaCost and EM1vs1 with five traditional classification algorithms Naive Bayes, Random Forest, IB1, Multi-Layer Perceptrons and Logistic Regression over 24 imbalanced data sets.

Firstly, we have compared the average performance of each given classification algorithm with the proposed EDBC algorithm and that with each other imbalance handling method. The classification performance on the imbalance problems has been increased by EDBC by 5−9.09 % for Naive Bayes, 1.2−6.33 % for Random Forest, 9.21−18.57 % for IB1, 4.88-10.26 % Multilayer Perceptron and 8.86−13.16 % for Logistic Regression, respectively. The comparison results shows that our proposed EDBC can greatly increase the performance of all given classification algorithms and outperform other imbalance solutions, overall.

Secondly, we also have analyzed and discussed the sensitivity from the determination of methods employed at each step for our proposed EDBC algorithm, they are feature selection methods, prototype selection methods, the number of prototypes and dissimilarity measures. Through comparative analysis on the classification performance of EDBC on the imbalance data sets, we have achieved a summary for EDBC that FAST is the best choice for feature selection, random selection and Jarvis-Patrick clustering algorithm are more effective for prototype selection with a rational number of prototypes (3−10 % of training examples) and Euclidean distance is more suitable for measuring the dissimilarity between examples in dissimilarity transformation.

Moreover, we have supplied the computational complexity of the proposed EDBC algorithm and compared it with other imbalance handling method. The comparison results show us that the complexity of the proposed EDBC algorithm may be a little higher than that of RUS and ROS, but it is significantly lower than that of each ensemble learning method. It means that the proposed expanded dissimilarity-based classification method can improve the imbalance learning performance effectively and efficiently.