Abstract
Although a large number of solutions have been proposed to handle imbalanced classification problems over past decades, many researches pointed out that imbalanced problem does not degrade learning performance by its own but together with other factors. One of these factors is the overlapping problem which plays an even larger role in the classification performance deterioration but is always ignored in previous study. In this paper, we propose a density-based adaptive k nearest neighbor method, namely DBANN, which can handle imbalanced and overlapping problems simultaneously. To do so, a simple but effective distance adjustment strategy is developed to adaptively find the most reliable query neighbors. Concretely, we first partition training data into six parts by density-based method. Next, for each part, we modify distance metric by considering both local and global distribution. Finally, output is made by the query neighbors selected in the new distance metric. Noticeably, the query neighbors of DBANN are adaptively changed according to the degree of imbalance and overlap. To show the validity of our proposed method, experiments are carried out on 16 synthetic datasets and 41 real-world datasets. The results supported by the proper statistical tests show that our proposed method significantly outperforms the state-of-the-art methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Imbalance classification, which has been widely applied in different scenarios including industrial manufacturing [1, 2], financial management [3, 4], biomedical engineering [5], information technology [6] and etc., is one of the critical issues in machine learning and data mining. Imbalanced datasets indicate a skewed distribution, namely the instances of one class outnumber the instances of other classes. Most of the standard classifiers tend to bias toward the majority class thus leading to high misclassification rate of minority instances. Imbalanced problem is commonly viewed as a main challenge to classification and has attracted great attention [7].
Existing solutions for imbalanced problem can be roughly grouped into four categories: resampling techniques, algorithm modification methods, cost-sensitive learning approaches and ensemble learning methods.
-
Resampling techniques aim to rebalance the training dataset by means of some mechanisms to generate a more or less balanced class distribution which is suitable for standard classifiers [8, 9].
-
Algorithm modification methods try to adjust the structure of standard classifiers to diminish the effect caused by class imbalance [10].
-
Cost-sensitive learning approaches generally consider higher cost for minority class to compensate for the scarcity of minority data [11].
-
Ensemble learning methods are originally developed to enhance the classification ability by combining different single classifiers. Moreover, researchers modify ensemble algorithms to adapt to imbalanced problem and show promising results [12,13,14].
In addition to imbalanced problem, overlapping between classes is convinced as another factor which degrades the learning performance [15, 16]. Overlap appears when a region contains almost equal numbers of instances from different classes. This situation results in a roughly same prior probability for each class and thus brings a strong handicap for classification. Overlapping problem is pervasive in many real-world applications such as fault diagnosis [17], character recognition [18], speech classification [19] and drug design [20]. In these scenarios, instances from different classes usually have similar characteristics in the feature space. For example, in character recognition, letters ‘O’, ‘o’ and numeral ‘0’ have almost identical shape which results in an overlapping region in the feature space therefore hard to separate. Previous investigations have shown that overlap degrades classification performance even more severely than imbalance [21]. To clearly present the relationship between two factors, a series of experiments are conducted by varying the degree of imbalance and overlap in the dataset. The conclusions state that learning algorithms can yield competitive performance when dataset has low overlapping degree combined with high imbalance ratio, but they are hard to achieve desirable results in high overlapping degree even imbalance ratio is low. This demonstrates that overlap is the main factor of classification degeneration [21,22,23]. Furthermore, Denil and Trappenberg [15] took size of dataset into consideration, and the study revealed that in small datasets the learning process is hindered by imbalance and overlap, respectively. However, when training data are sufficient, two factors jeopardize the learning performance interact.
Currently, most of the researches deal with overlapping and imbalanced problems separately. However, in practical application level, overlapping problem frequently occurs in imbalanced data which poses a greater challenge to classification. Although a few papers attempt to consider both factors as a whole [24, 25], the structures of related algorithms are too complex to implement. To fill this gap, we propose a density-based adaptive nearest neighbor method (DBANN) which can deal with these two problems simultaneously with a simple structure. The main idea of DBANN is to develop an adaptive distance adjustment strategy which devotes to defining and making use of reliable query neighbors. To the best of our knowledge, our approach is the first kNN-based method which aims to combat both imbalanced and overlapping problems. The main contributions of our study can be summarized as follows:
-
We propose a density-based kNN method named DBANN which can handle imbalanced and overlapping problems simultaneously.
-
To enhance classification ability, we develop a distance adjustment strategy using density-based methods to adaptively find out the most reliable query neighbors.
-
To validate the effectiveness of DBANN, we compare with other state-of-the-art methods on 16 synthetic datasets and 41 real-world datasets, respectively.
The outline of this paper is organized as follows: The related works are described in Sect. 2. In Sect. 3, we introduce our proposed method DBANN. Section 4 presents extensive experiments on both synthetic and real-world datasets. Results and discussion are shown in Sect. 5. Finally, Conclusions are presented in Sect. 6.
2 Related works
2.1 Overlapping problem in imbalanced datasets
In binary classification, imbalanced data refer to the distribution when instances of one class outnumber the other one, as shown in Fig. 1a. In this situation, minority class is hard to be recognized by standard classifiers which prefer to take a good coverage of majority class for achieving desirable global performance. However, in real applications, minority class always contains critical information we need, such as scrap part data within all products, patient information among all people. Therefore, it is imperative to take a deep insight into the data intrinsic characteristic in imbalanced data. With this in mind, we realize that imbalanced data dose not hinder learning ability by its own but together with some other factors, such as size of a dataset [21, 26], noise [27, 28], small disjunct [29, 30] and data shift [31, 32].
Overlap occurs in the region where both classes co-exist and is viewed as one of the main obstacles to classification [15], as depicted in Fig. 1b. In such region, the probability of each class is approximately equal which gives rise to high misclassification rate. In order to quantify overlapping degree for individual feature dimension, Ho and Basu [33] proposed a metric called maximum Fisher’s discriminant ratio (F1), as shown in Eq. (1) where \(\mu_{1} ,\mu_{2,} \sigma_{1} ,\sigma_{2}\) indicate the means and variances of the two classes, respectively. For a multidimensional dataset, the maximal \(f\) among all the features is defined as F1. Datasets with a low value of F1 will have a high degree of overlap and vice versa. To overcome overlapping problem, some researchers tended to change data distribution before modeling. Tang [16] firstly transformed original data into a more separated data distribution throughout overlapping pattern extraction and rough set theory, and then a proposed DR-SVM is implemented on the transformed data. Batista et al. [34] used data cleaning techniques to cope with highly overlapping data and achieved desirable results. Similarly, in order to better prepare the data for classification, other pre-processing methods such as data selection and feature selection are proposed in [35, 36]. However, the pre-processing methods may involve the risk of noise introduction or information loss. Xiong et al. [37] found that modeling the overlapping and non-overlapping regions, respectively, is a promising scheme for solving class overlapping problem. Follow this line of thinking, Vorraboot et al. [38] first partitioned training data into non-overlapping region, borderline region and overlapping region. Afterward, different techniques were employed for different regions. Finally, the outputs of all techniques were combined. Nevertheless, the study is only suitable for two Gaussian classes with independent and identical distributions.
In the real-world application cases, overlapping problem and imbalanced problem co-exist frequently in the dataset but are always ignored in previous studies. Figure 1c marks such two overlapping regions with circles. Comparing with Fig. 1b, we can find that overlapping regions here exhibit comparative high density in the perspective of global distribution. That is partly attributed to the sparsity of minority class which inversely accentuates the compactness of two overlapping regions. Additionally, the imbalance ratio is different in overlapping region and other regions. It is worth noting that for some learning algorithms which are based on a divide and conquer strategy [39], the variation of class distribution in different regions may pose a threat to the classification performance. Besides, overlapping and imbalanced problems also influence those algorithms which are sensitive to data density such as k nearest neighbors.
To address overlapping and imbalanced problems, Alejo et al. [25] developed a hybrid method which combines a modified back propagation (MBP) with a gabriel graph editing technique (GGE). MBP copes with imbalanced issue and GGE is responsible for overlapping problem. Vuttipittayamongkol et al. [40] proposed an overlap-based under-sampling method (OBU). Based on elimination of majority instances from overlapping region, OBU improves the visibility of minority instances. So far, the related studies are far from enough.
2.2 kNN-based methods for dealing with overlapping or imbalanced datasets
kNN is one of the typical non-parametric approaches which is widely applied in diverse domains due to its simple but powerful decision rule [41, 42]. However, when encountering imbalanced class distribution, kNN tends to lose power on yielding competitive results [43, 44]. To cope with it, kNN is modified to incorporate immunity against the influence of imbalance. Concretely, Kriminger et al. [43] proposed a class conditional nearest neighbor distribution algorithm (CCNND). To mitigate the effect of imbalanced distribution, for each class, CCNND calculates the number of training instances which satisfies a specified distance condition within the k nearest neighbors of a query instance. Afterward, an empirical cumulative distribution function (CDF) is built and the probability for each class is computed. Dubey and Pudi [45] provided a weighting scheme (here after called W-kNN) to address imbalanced issue. W-kNN assigns weight to each class based on the misclassification rate obtained by traditional kNN. Patel [46] developed a hybrid weighted strategy (here after called H-kNN). The main advantage is the use of dynamic k value, i.e., small k for minority class and large k for majority class. Therefore, H-kNN improves the ability to fully mine the information from imbalanced distribution. On this basis, the same author took fuzzy rule-based classification into consideration and proposed an improved fuzzy k-nearest neighbor (here after called F-kNN) [47]. Based on fuzzy membership, the query instance is allowed to know prior that how much its neighbors belong to a class. Zhang and Li [48] presented a minority-biased nearest neighbor algorithm called PNN. In order to handle the inappropriate probability estimation for minority class, PNN fixes the number of minority query neighbors. For example, m-PNN means that there must be m minority instances in query neighbors. Therefore, the number of query neighbors changes dynamically to ensure enough instances for probability estimation for both classes. k rare-class nearest neighbor classification (kRNN) [49] boosts PNN by updating the dynamic query neighbors strategy. The new strategy reinforces the analysis of distributions around query instances. kRNN can handle not only inter-class imbalance but also within class imbalance. Mullick et al. [50] proposed an adaptive learning kNN method called Ada-kNN. It uses a class-based global imbalance handling scheme (GIHS) to compensate for the disadvantage of minority data scarcity. To assign global weight for each class, GIHS considers both ideal class probability (balanced distribution) and reality class probability (imbalanced data distribution).
Overlapping problem is another obstacle for learning algorithms with no exception to kNN, as stated in Sect. 2.1. Garcia et al. [44] investigated the behavior of kNN when overlap exists in imbalanced data. The results reveal that when imbalance ratio in overlapping region equals to global imbalance ratio, i.e., majority class dominates the overlapping region, true positive rate (TPR) drops with the increase in overlapping degree. Conversely, when minority class turns to be the most represented class in overlapping region, the TPR increases on the opposite. Additionally, they pointed out that imbalance ratio in overlapping region accounts more than the size of overlapping region and global imbalance ratio. Wang et al. [51] proposed an extremely simple but well performed algorithm called A-kNN. It aims to form reliable query neighbors for final decision. To do so, A-kNN modifies the distance metric to move the reliable instances closer to the query instance. Hence, even the query instance locates in the overlapping region which is viewed as ambiguous and untrusted area, the query neighbors selected after distance adjustment are reliable. Although A-kNN is an effective solution for overlapping problem, it cannot handle imbalanced issue.
Even though some efforts are made for kNN to enhance the classification performance, there are still some drawbacks remaining to be improved in further research. Firstly, previous kNN-based methods treat imbalanced and overlapping problems separately though the two factors always co-exist in the real-world applications. Secondly, some modified kNN methods choose pivot instances as query neighbors for decision making, nevertheless, the choice criterion they use considers either global or local information. When data density varies a lot in different regions, especially when imbalance ratio is significantly different among these regions, the choice criterion cannot work effectively anymore. Finally, previous works ignore the influence of noisy instances which can jeopardize the classification performance significantly. In the following sections, we will solve above concerns with a novel method.
3 Combating overlapping and imbalanced problems using density-based adaptive k nearest neighbor method (DBANN)
In this section, we expect to conquer overlapping and imbalanced problems by a density-based strategy. For ease of discussion, in this paper, we focus on binary class problem even though it can be generalized to multi-class problem. Considering a given training dataset \(D\) with \(N\) instances, \(D = \left\{ {\left( {x_{1} ,y_{1} } \right),\left( {x_{2} ,y_{2} } \right) \ldots \left( {x_{{n}} ,y_{{n}} } \right)} \right\}\), where \(x_{i} \in X\) is a training instance in \(m\) dimensions, \(x\) is a query instance. Traditional kNN first determines the \(k\) nearest neighbors of query instance based on euclidean distance in Eq. (2). Then, majority vote in Eq. (3) is used to classify \(x\). Besides, \(IR\) in Eq. (4) indicates the imbalance ratio of a dataset, where \(N_{{\rm minority}}\) refers to the number of minority instances and \(N_{{\rm majority}}\) refers to the number of majority instances. It is worth noting that we use \({\text{IR}}_{{\rm global}}\) as the imbalance ratio of the whole dataset and \({\text{IR}}_{{\rm local}}\) as the imbalance ratio in a specified region in the sequent sections.
3.1 Description of reliable query neighbors
Previous studies have shown the significance of query neighbors in classification for kNN [48, 52]. Inspired by this, the main idea of DBANN is to seek for reliable query neighbors which are used for majority vote. Different from traditional kNN, DBANN selects query neighbors depending not only on distance but also data characteristic, i.e., imbalanced and overlapping distribution in local and global region. Besides, we expect the query neighbors can adaptively change to adapt different imbalanced and overlapping degrees. Therefore, first of all, we describe the characteristics of reliable query neighbors as follows:
-
There is no doubt that the query neighbors should be the instances locating near around the query instance so as to include more representative information.
-
Due to the scarcity of minority data, the query neighbors should be biased toward the minority class.
-
Since instances in overlapping region are hard to separate, it is suggested to view these instances unreliable and lower the probability of the selection for query neighbors.
-
In most cases, noisy instance is an obstacle for classification [28, 53, 54]. Hence, it should be avoided to be selected as query neighbors.
-
Since kNN is proved to be sensitive to data complexity and class density [44], it is desirable to involve density and class distribution factors into consideration.
3.2 A density-based adaptive k nearest neighbor method (DBANN)
Most of previous researches of kNN method concentrate on addressing imbalanced problem but overlook the influence of overlap. Literature [51] proposes an adaptive kNN method (A-kNN) to deal with overlapping problem. Different from traditional kNN, A-kNN modifies distance metric according to overlapping degree. First of all, for each training instance \(x_{i}\), A-kNN creates a reliable coefficient \(r_{i}\). It is the distance from a training instance \(x_{i}\) to another training instance xj which is the nearest neighbor belonging to different class from \(x_{i}\), as listed in Eq. (5). Based on observation, we can easily find that a lower \(r_{i}\) value means \(\left( {x_{i} ,x_{j} } \right)\) locate closer to each other which implies that high overlapping degree exists in this region; therefore, \(x_{i}\) is viewed as an unreliable instance and vice versa. With this in mind, we know that \(r_{i}\) value can measure reliable degree of \(x_{i}\). A high \(r_{i}\) value implies that \(x_{i}\) has high reliable degree and helpful in classification, on the contrary, a low \(r_{i}\) value indicates that \(x_{i}\) is useless and unreliable. After obtaining \(r_{i}\) for each training instance, A-kNN adjusts and forms a new distance metric by Eq. (6). Finally, the output is obtained by Eq. (3). Technically speaking, A-kNN is a local method to handle overlapping problem. However, the imbalance issue is not considered.
In this study, we extend the concept of \(r_{i}\) to handle both imbalanced and overlapping problems.
Concretely, in the first step, we cluster training data into several clusters and noisy instances by a density-based method (we will introduce it in Sect. 3.3). To further characterize the clusters, we consider overlapping issue and divide clusters into two types with definitions listed as follows.
Definition 1
Overlapping cluster indicates that the cluster contains both majority and minority instances.
Definition 2
Non-overlapping cluster indicates that the cluster contains only majority or minority instances.
Therefore, after clustering, training data can be divided into six parts: (a) minority noisy instances, (b) majority noisy instances, (c) majority instances in overlapping cluster, (d) minority instances in overlapping cluster, (e) majority instances in non-overlapping cluster and (f) minority instances in non-overlapping cluster. Figure 2 depicts the six parts in detail.
Afterward, in the second step, we assign reliable coefficient \(r_{i}\) to each training instance \(x_{i}\). Different from A-kNN, we capture the distribution variation and take noise factor into consideration. Specially, we assign \(r_{i}\) for training instances in each part as follows:
-
(a)
For minority noisy instances, \(r_{i}\) is assigned as the distance from \(x_{i}\) to the nearest majority neighbor.
-
(b)
For majority noisy instances, \(r_{i}\) is assigned as a randomly little positive value.
-
(c)
For minority instances in overlapping cluster, \(r_{i}\) is assigned as the distance from \(x_{i}\) to the \({\text{IR}}_{{\rm local}}\)th nearest majority neighbor. \({\text{IR}}_{{\rm local}}\) equals to the imbalance ratio in the corresponding cluster which represents the local distribution. Obviously, high imbalance ratio expands the detection radius of \(x_{i}\) and obtains larger \(r_{i}\) accordingly. Noticeably, \(r_{i}\) also relates to the density of a cluster. A cluster with a high density means a large amount of instances locating in a small region together, and even detection radius of \(r_{i}\) expands to the \({\text{IR}}_{{\rm local}}\)th nearest majority neighbor it may probably increase by only a small value compared with low-density clusters.
-
(d)
For majority instances in overlapping cluster, \(r_{i}\) is assigned as the distance from \(x_{i}\) to the nearest minority neighbor.
-
(e)
For minority instances in non-overlapping cluster, \(r_{i}\) is assigned as the distance from \(x_{i}\) to the \({\text{IR}}_{{\rm global}}\) th nearest majority neighbor. Different from \({\text{IR}}_{{\rm local}}\) in overlapping situation, \({\text{IR}}_{{\rm global}}\) here is calculated by global imbalance ratio.
-
(f)
For majority instances in non-overlapping cluster, \(r_{i}\) is also assigned as the distance from \(x_{i}\) to the nearest minority neighbor.
In the next step, we adjust the distance metric by Eq. (6). We can find that the new distance metric depends on two conditions, \(r_{i}\) and the euclidean distance \(d\left( {x,x_{i} } \right)\). A training instance \(x_{i}\) is reliable when it locates near to the query instance \(x\) as well as possessing higher \(r_{i}\) value. It is obvious that, after distance adjustment the reliable instances are pulled closer to query instance and the unreliable ones are pushed away.
Finally, majority vote is implemented in new distance metric by Eq. (3). From the procedure introduced above, we can find that in \(r_{i}\) assigning procedure, our method considers not only local but also global distribution of the dataset.
Figure 2 illustrates the process of DBANN. The star and circle represent the majority and minority class, respectively, and the shape of hollow and solid represent the original and current distribution (after distance adjustment). Here, we only focus on the change of the instances marked with letters. Firstly, we concentrate on points a and b which distribute in the overlapping cluster. From the graph, we can see two instances distributing closer to each other which indicates that high overlapping degree exists. Therefore, they are viewed as unreliable points and ought to be pushed far away from the query instance by Eqs. (5) and (6). Nevertheless, except for overlapping degree, imbalanced issue is also considered. As a result, only majority point b is pushed to B whereas point a is pulled to A which is closer to query instance due to local imbalanced ratio (\({\text{IR}}_{{{\text{local}}2}} = 2\)). Similar distance adjustment method is also suitable for points c and d, and the only difference is that point c is put closer to query instance due to higher local imbalance ratio (\({\text{IR}}_{{\rm local1}} = 7\)). Additionally, majority noisy point e is pushed far away due to the little number setting and minority noisy point f is pushed to F. As for minority point g in non-overlapping cluster, global imbalance ratio (\({\text{IR}}_{{\rm global}} = 3\)) is used to pull it to G, and majority point h is moved to H. Consequently, after distance adjustment, the query neighbors are point A, C, G and query instance is predicted as minority. Details of DBANN are listed in Algorithm 1.
3.3 Density-based clustering algorithm
In Sect. 3.2, we take advantage of density-based clustering method to divide training data into different parts. Technically speaking, it is a framework of our method which implies that many existing density-based clustering algorithms are optional in DBANN. In this paper, we choose DBSCAN as our method.
DBSCAN is a typical density-based clustering algorithm which defines cluster as a region of high dense points separated by regions of lower dense points. It has attracted much attention by its desirable properties including arbitrary shaped clusters, automatic cluster number identification and noise detection [55, 56]. Additionally, as a useful method to capture data distribution, DBSCAN is always implemented by combing with other algorithms to enhance classification ability, especially when facing severely complex data distribution [38, 57]. Generally, DBSCAN characterizes density variation by resorting to two input parameters, a positive value \(eps\) and a positive constant integer \(Minpts\). On this basis, some definitions of DBSCAN are listed as follows:
Definition 1
\(eps{\text{-}}neighborhood\) of a point \(p\) indicates the points within the radius \(eps\) around \(p\).
Definition 2
A point \(p\) is a core point if the number of its \(eps{\text{-}}neighborhood\) is more than \(Minpts\).
Definition 3
A point \(p\) is directly density-reachable from a point \(q\) if \(q\) is a core point and \(p\) is its \(eps{\text{-}}neighborhood\).
Definition 4
A point \(p\) is density-reachable from a point \(q\) if there is a chain of points \(p_{1} ,{\kern 1pt} {\kern 1pt} \ldots {\kern 1pt} p_{n}\), \(p_{1} = q\), \(p_{n} = p\) which satisfies that \(p_{i + 1}\) is directly density-reachable from \(p_{i}\).
Definition 5
A border point \(p\) is the \(eps{\text{-}}neighborhood\) of a core point \(q\) which has fewer neighbors than \(Minpts\) within the same \(eps\) radius.
Definition 6
A noisy point \(p\) is the point neither a core point nor a border point.
Initially, DBSCAN arbitrarily selects a point \(p\) and retrieves all \(eps{\text{-}}neighborhood\), this process is defined as QueryNeighbour. If the number of \(eps{\text{-}}neighborhood\) is larger than \(Minpts\), point \(p\) is assigned as core point and thus forming a new cluster, otherwise, \(p\) is assigned as a noisy point. Subsequently, the cluster expands by adding unvisited density-reachable points iteratively. The process is repeated until every unvisited point is marked either in a cluster or a noisy point. Noticeably, even a point is marked as a noisy point initially, it may be transformed to a border point of other cluster during cluster expanding process. Finally, DBSCAN forms several clusters and noisy points. Figure 3 demonstrates the process of DBSCAN (\(Minpts\) = 4, \(eps\) is indicated by the circles). As can be seen from the graph, point A is marked as core point at the beginning and thus creates a new cluster. Afterward, the cluster expands based on density measurement. It involves all blue points in the cluster until it reaches the yellow border points (F, G) which are the edge of the cluster. Due to low density, points (H, I) are marked as noisy points. The detail of DBSCAN is shown in Algorithm 2.
4 Experiments
In this section, experiments are carried out on 16 synthetic datasets and 41 real-world datasets to validate the effectiveness of DBANN method. The details of datasets are described in Sect. 4.1. In Sect. 4.2, we list the comparative algorithms and the corresponding parameters setting. Besides, the evaluation metrics and statistical tests are introduced in Sect. 4.3. All experiments are carried out by using python (Version 2.7.14).
4.1 Datasets used in the experiments
In this section, experiments are conducted on both synthetic and real-world datasets. Specially, synthetic data are generated for both classes from bivariate normal distributions. On the one hand, to explore the classification performance in different overlapping degrees, the mean vector of minority class is fixed at [0.00, 0.00], and the mean vector of majority class is set at [0.05, 0.05], [0.50, 0.50], [1.30, 1.30], [2.70, 2.70], respectively which represents four overlapping degrees. In detail, mean vector of [0.05, 0.05] means the data centers of two classes are closed; thus, severe overlapping region exists. Mean vector of [2.7, 2.7] indicates that the two centers locate far away so that the overlapping degree is rare. On the other hand, synthetic data are also generated into four different imbalance ratios by changing numbers of both classes. The description of synthetic datasets is shown in Table 1. As for real-world applications, we select 41 datasets from KEEL repository [58] refers to previous research [35], as shown in Table 2. KEEL is an open source which provides benchmark datasets for assessing the behaviors of the algorithms in different scenarios [58]. Datasets are ordered by overlapping degree according to Fisher’s discriminant ratio (F1), in which we can divide them into two parts, low overlapping datasets with F1 > 1.6 and high overlapping datasets with F1 < 1.6. Imbalance ratio (IR) of datasets ranges from 1.8 to 68.1, as shown in Table 1.
4.2 Algorithms and parameter settings
In our experiments, DBANN is compared with other algorithms and strategies. The comparative algorithms can be divided into three directions: (a) kNN-based methods: The algorithms derived from k nearest neighbor are modified to address imbalanced or overlapping problems, including W-kNN [45], kRNN [49], F-kNN [47], H-kNN [46] and standard classifier kNN. (b) Generality-oriented learning algorithms and strategies: CART decision tree, support vector machine (SVM) together with data balancing methods SMOTE [59] and overlap-based under sampling (OBU) [40] which are popular in handling imbalanced and overlapping problems. (c) Ensemble algorithms: Kd-tree-based efficient ensemble (KDE) [60], hybrid sampling with bagging (HSB) [61] and RUSBoost (RUS) [62] represent three effective ensemble methods of bagging and boosting, respectively, for imbalanced problem.
-
For kNN-based method W-kNN, kNN, F-kNN, H-kNN, DBANN, kRNN, parameter \(k\) is chosen from original literature which is set to 3, 3, 3, 3,3, 3,1, respectively. The other parameter is set according to the original literature. For DBANN, \(Minpts\) is set to 4, and \(eps\) is chosen as the optimal value from the range [0.01, 200] by 10-fold cross-validation.
-
For generality-oriented algorithms, support vector machine is implemented with linear kernel which shows desirable performance in selected datasets. SMOTE and OBU are conducted to generate an equal number between minority class and majority class before classification.
-
For ensemble algorithms, the base classifier is decision tree, and the number of base classifier is {10,10,40} for KDE, HSB and RUS, respectively, according to original literature. In KDE, k = 3, \(\varepsilon = 0.1\), and in HSB, k = 3, I = {0, 0.2, 0.4, 0.6, 0.8, 1}.
4.3 Performance measures and significance statistical test
All experiments are carried out by employing 10-fold cross-validation. The confusion matrix in Table 3 shows four types of classification results. On this basis, two indicators geometric means metric (GM) and F-measure metric (F1) are used to evaluate the classification performance, and detailed definitions are shown in Eqs. (7)–(11). It can be seen from Eqs. (10) and (11) that GM considers the proportion of correctly classified instances in both minority and majority classes, while F1 focuses more on the average performance of precision and recall.
To evaluate if significant difference exists among experimental algorithms, it is necessary to use statistical tests. Here we adopt non-parametric statistical Friedman test and Bonferroni–Dunn post hoc test [63]. Friedman test is first employed to detect differences among all the algorithms in two indicators. After that, Bonferroni–Dunn is applied to check out if DBANN performs significantly better than comparative algorithms.
To implement Friedman test, we first rank the performance of \(K\) algorithms on each dataset, the best performance ranks 1, the worst ranks \(K\). When tie appears, average rank is assigned to each algorithm. Subsequently, we compute the Friedman statistic \(\chi_{F}^{2}\) by Eqs. (12) and (13). Specifically, \(r_{ij}\) denotes the rank of the \(j\)th of \(K\) algorithms on the \(i\)th of \(N\) datasets. As a result, \(R_{j}\) represents the average rank of the \(j\)th algorithm. Moreover, Iman and Davenport [64] found that Friedman’s \(\chi_{F}^{2}\) was undesirably conservative and created a better statistic value \(F_{F}\) according to F-distribution with \(\left( {K - 1} \right)\) and \(\left( {K - 1} \right)\left( {N - 1} \right)\) degrees of freedom as shown in Eq. (14). Critical value \(q_{\beta }\) is calculated by \(q_{\beta } = F\left( {\alpha ,{\rm K} - 1,\left( {{\rm K} - 1} \right)\left( {N - 1} \right)} \right)\). When \(F_{F} > q_{\beta }\), null-hypothesis is rejected, i.e., significant difference exists among the comparative algorithms and vice versa. Furthermore, once the null-hypothesis is rejected, the post hoc tests Bonferroni–Dunn test is proceeded to conduct pairwise comparisons between DBANN and other algorithms. Here, critical value \(q_{\gamma }\) is based on the studentized range statistic divided by \(\sqrt 2\) [63]. The significant differences exist when average ranks of two algorithms differ by at least the critical difference (CD) [63]
5 Results and discussion
5.1 Analyzing the critical parameters and property of DBANN
In this section, we provide an insight into detailed properties of DBANN. We first discuss the influence of parameter \(eps\) and parameter \(k\) on classification performance. Afterward, we investigate the distribution of query neighbors in DBANN. Finally, we analyze the advantage of DBANN over other kNN-based methods.
5.1.1 \(eps\) value
\(eps\) and \(Minpts\) are two input parameters in DBSCAN. Previous researches [55, 65] reported that \(Minpts\) has little impact on the clustering results. Therefore, in this section \(Minpts\) is set to 4 and \(eps\) is varied from 0.01 to 200 to analyze its influences on classification performance. Here we choose five real-world datasets for experiments and the results are shown in Fig. 4. It is easy to realize that \(eps\) is a sensitive parameter which dominates the performance of DBANN. In general, with the increase of \(eps\), F1 and GM experience an increasing trend with fluctuation (here we call it phase I), and then the performance tends to be stable at a fixed range in the end (here we call it phase II). Noticeably, for some datasets (yeast1, glass016vs5, ablone21vs8) the optimal \(eps\) value exists in phase I, as for others (Newthyroid2, winequalityred3vs5) the optimal \(eps\) value exists in phase II. In this study, grid search is used to determine the optimal \(eps\) value.
In order to reveal the root cause behind the sensitivity of \(eps\) value, a further analysis is provided on relationship among \(eps\), clustering situation and classification performance. We demonstrate this issue based on dataset glass016vs5 and the results are shown in Table 4. We find that \(eps\) directly affects the clustering situation. When \(eps < 0.1\), all instances are defined as noisy instances and no clusters are formed. With the increase of \(eps\), more noisy instances are transferred to form clusters. When \(eps = 0. 3\) DBANN forms at most five clusters. Subsequently, clusters expand and merge until forming one big cluster with no noisy instance exists at last. As a result, F1 and GM also vary according to \(eps\) as shown in the last two columns in Table 4. Particularly, when \(eps = 2\), DBANN achieves the optimal performance (F1 = 53.53%, GM = 72.96%) with two clusters and ten noisy instances.
As stated in Sect. 3.2, clustering results directly decide the choice of query neighbors. Therefore, \(eps\) which dominates the clustering situation is sensitive to classification performance of DBANN.
5.1.2 \(k\) value
To analyze the influence of \(k\) value, we fix the setting of \(eps\) at the optimal value and vary \(k\), \(k = 1.2.3, \ldots ,{\kern 1pt} 6 0\) on five real-world datasets. The classification results in Fig. 5 show that in most datasets, the classification performance drops with the increasing \(k\) value. Especially, when \(k\) reaches up to 60, the performances reduce to 0. This can be partly explained by the imbalanced solution of DBANN. Essentially, DBANN does not generate additional synthetic instances to compensate for minority class but increase the probability of minority class in query neighbors selection process. However, when \(k\) is too large, the proportion of minority instances in query neighbors cannot increase more even if the selection probability is 100% due to the imbalanced distribution, which results in performance loss. Based on our experience, when \(k = 3\), the performance is desirable.
5.1.3 Distribution of reliable query neighbors
In this section, we investigate the distribution of query neighbors by a series of experiments. We set \(k = 3\), \(eps\) at the optimal value in the whole experiments. We define the ranking for each training instance by sorting the distance from training instances to a query instance in ascending order, i.e., the nearest instance ranks 1.
We take glass0 dataset as an example to demonstrate this issue. We first implement DBANN on glass0 dataset by 10-fold cross-validation. For each run in the fold, we record the rankings of query neighbors for all query instances. After 10-fold cross-validation, we get the whole rankings. For example, we have 90 training instances and 10 query instances in onefold. In each run, we record the rankings of the three query neighbors for each query instance thus there are 30 rankings. After 10-fold cross-validation, there are totally 300 rankings which are used here for analysis. To facilitate the observation, we calculate the proportion of rankings in different intervals (1st, 2nd–3rd, 4th–5th, 6th–7th, 8th–9th, 10%th–20%th, 20%th–100%th), and the result is shown in Fig. 6a. Obviously, different from traditional kNN, the distribution of query neighbors of DBANN is not 100% \(k\) nearest neighbors anymore. Actually, the proportion of first three rankings (traditional kNN) only accounts for 13.83%, and the largest proportion of query neighbors distributes in the rankings in 10%th–20%th among all the training instances. Noticeably, the instance locates far away from the query instance (rankings in 20%th–100%th) also can be selected as query neighbors even though the proportion is only 2.19%. This partly shows that DBANN considers not only the local but also the global distribution in the selection of query neighbors.
To make a better understanding of the query neighbor selection mechanism, we expand the experiments on 16 synthetic datasets (introduced in Sect. 4.1) which can be divided into four overlapping levels as well as four imbalanced levels. We define k nearest neighbors of a query instance (traditional kNN query neighbors) as local neighbors and then analyze the proportion of local neighbors on all the query neighbors in DBANN. Higher proportion indicates decision making depends more on local distribution. Conversely, lower proportion indicates decision making is more relied on global distribution.
From the results in Fig. 6b, we can see that when overlapping degree is rare, DBANN relies more on local neighbors with the proportion approximate to 0.85. In contrast, when overlap is severe, the proportion of local neighbors is significant lower on an average. This can partly show our advantage in query neighbors selection, i.e., when data distribution is tough, DBANN adaptively expands the detection radius to search for more reliable instances even though they locate far away. Besides, Fig. 6c shows the comparisons in different imbalanced degrees, and the graph clearly demonstrates that on rare and severe imbalanced degrees, the proportion of local neighbors drops with the increase in overlapping degree. However, on severe imbalanced datasets, the proportion of local neighbors on moderate and severe overlapping datasets is lower than that on rare imbalanced datasets. Above discussions imply that DBANN can adaptively select query neighbors from local to global region in different scenarios.
5.1.4 Advantage of DBANN over other kNN-based methods
To further study the advantage of the query neighbor selection mechanism in DBANN, we compare DBANN with other kNN-based algorithms on two typical datasets glass1 and yeast2vs4, which have different overlapping degrees and imbalance ratio, as a case study.
To better demonstrate this issue, we divide each dataset into overlapping region and non-overlapping region so as to take a closer look at the performance of each algorithm in different regions (whole region, overlapping region and non-overlapping region). Inspired by [66], we use kNN (k = 5) to separate the two regions. First of all, for each instance, it is considered to be in non-overlapping region if the instance and all its 5 nearest neighbors belong to the same class, otherwise, it is considered to be in overlapping region. Secondly, we calculate imbalance ratio (IR) and Fisher’s discriminant ratio (F1) in different regions, respectively (Table 5). Finally, we run 6 kNN-based algorithms on two datasets and record their performances in different regions (Table 6). It is worth noting that F1 in Table 5 indicates overlapping degree while F1 in Table 6 indicates F-measure.
By observing Table 5, we note that the local distribution differs in different regions, in which the imbalance ratio is approximate to 1 in overlapping region while it is much higher in non-overlapping region. The overlapping degree (F1) is higher in overlapping region than other regions. All these results support the previous conclusion that the distribution of overlapping region is complex and hard to learn. This conclusion is also proved when we compare the performance of all kNN-based algorithms on two datasets in different regions in Table 6, in which F1 and GM value in overlapping region is significant lower than non-overlapping regions as well as the whole region. However, it is worth noting that DBANN performs better than other algorithms in overlapping region on both datasets. In glass1, DBANN achieves the best results (F1: 72.50, GM: 45.20) in overlapping region although its final result (F1: 76.23, GM: 79.72) only ranks fourth among all the algorithms. Meanwhile, it is witnessed that the performance of DBANN in non-overlapping region does not drop significantly compared with other algorithms. The same situation also occurs in yeast2vs4. Above results indicate that DBANN is able to excel in overlapping region at the cost of a small loss in non-overlapping region. This property is convinced as the main advantage of DBANN over remaining algorithms and we believe this property is beneficial from the adaptive query neighbors selection mechanism which is sensitive to the variation of local distribution.
5.2 Overall performance of DBANN
5.2.1 Performance on synthetic datasets
In this part, we validate the effectiveness of our proposed method by a bunch of experiments. Tables 7 and 8 show the comparison results of DBANN with kNN-based methods in F1 and GM on 16 synthetic datasets. The optimal result in each dataset is highlighted in bold-face. It can be found that DBANN performs better than other methods in almost all datasets in terms of average rank in F1 and GM. Particularly, when data distribution is severely overlapping, i.e., in datasets A1–A4, DBANN obtains the best average rank in both F1 (1.50) and GM (2.25). This implies the advantage of query neighbors selection mechanism in the face of extreme tough data distribution. When overlapping degree is moderate or slight (B1–B4, C1–C4), DBANN obtains the optimal results in all datasets except for GM in B1 and C1. As for imbalance issue, we observe that DBANN performs better in high imbalance ratio (1:9,1:19) with average rank 1 in GM while the average rank in low in imbalance ratio datasets (1:2,1:4) is 2.5. This demonstrates that DBANN has the ability to handle the high imbalanced distribution.
Moreover, to analyze statistical significance differences in comparative methods, Friedmen test (FR) is carried out. According to F-distribution, the critical value \(q_{\beta }\) is \(F\left( {0.05,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 5 \times 15} \right) = 2.901 3\). From the results in Table 9, we can see \(F_{F} > q_{\beta }\) in both F1 and GM which indicates that there are significant differences existing among all compared methods. Subsequently, the pairwise comparisons are conducted by Bonferroni–Dunntest. The critical value \(q_{\gamma }\) for two-tailed Bonferroni–Dunn test (\(\alpha = 0.05\)) with 6 algorithms is 2.576 [63]. We highlight the algorithms which are significantly different from DBANN in bold-face. Concretely, differences exist in W-kNN, kNN, F-kNN, H-kNN in F1, and W-kNN, kNN, H-kNN in GM. Additionally, we notice that DBANN seems similar with kRNN with regard to F1 and GM in statistical test. However, we know that the targets and structures of two algorithms are totally different. kRNN tends to bias the posterior probability estimation toward the minority class based on local distribution to handle imbalanced problem whereas DBANN aims to boost performance by searching for reliable query neighbors in both local and global distribution by additionally considering overlapping issue.
5.2.2 Performance on real-world datasets
In this section, we compare DBANN with kNN-based methods as well as generality-oriented methods on 41 real-world datasets. Tables 10 and 11 show that the average ranks of GM and F1 of DBANN are 2.3902 (1) and 1.9268 (1), respectively, indicating that DBANN achieves better performance than other kNN-based methods. In order to obtain clear insights into the behaviors of DBANN, we analyze the results in different distributions by means of a statistical study. In the first place, by observing the overlapping issue (high overlapping degree: F1 < 1.6, low overlapping degree: F1 ≥ 1.6), we note that the average ranks of DBANN are 2.15 and 2.2 with respect to F1 and GM in the high overlapping datasets while the average ranks in the low overlapping datasets are 2.61 and 2.68, respectively. These results support the ability of DBANN in face of high overlapping distribution. Moreover, it is also witnessed that when high overlap and high imbalance co-occur, i.e., IR > 20 and F1 < 1.6, DBANN still outperforms most of other methods. Especially, in datasets yeast1458vs7, yeast1289vs7, winequalityred3vs5, yeast2vs8 and yeast4, DBANN obtain the optimal results. This good behavior is due to the query neighbors selection mechanism of DBANN which helps to provide query neighbors with more reliable information when minority class is scarce and distribution is overlapping. Again, we implement Friedmen test (FR), Bonferroni-Dunntest (BD) (\(q_{\beta } = F\left( {0.05,5 \times 40} \right) = 2.45,\)\(q_{\gamma } = 2.576\)) on real-world datasets and find that DBANN presents significant difference from W-kNN, F-kNN in F1 and W-kNN, kNN, F-kNN and H-kNN in GM among kNN-based methods, as shown in Table 12. As for generality-oriented methods, DBANN also achieves superior performance which is listed in Tables 13 and 14. Especially, in terms of F1, DBANN gets the smallest average rank 2.2561 which is superior to the second rank 3.4146 by a large margin. Likewise, the significant test is listed in Table 15 which indicates that there are differences between DBANN and most of the algorithms except for HSB, SVM + SMOTE in GM, and KEC in F1.
6 Conclusions
In this study, we propose a novel method DBANN to deal with both imbalanced and overlapping problems. The main idea of DBANN is to find the most reliable query neighbors by using density-based methods. We first divide the training data into six parts by DBSCAN, and then in each part we assign reliable degree to instances based on density, class imbalance as well as overlapping situation. Afterward, we adjust the distance metric according to reliable degree to make reliable instances more likely to be selected as query neighbors. Finally, output is made by reliable query neighbors.
Different from existing kNN-based methods, DBANN takes advantage of both local and global information in query neighbors selection. Additionally, noise factor is also considered in DBANN to boost the classification performance. It is worth noting that the query neighbors in our method change adapt according to data distribution. To validate the effectiveness of DBANN, we implement experiments on both synthetic datasets and real-world datasets. The results show that our method outperforms kNN-based methods as well as generality-oriented methods in terms of F1 and GM.
Further research is required to extend DBANN to multi-class classification problems in the future. Moreover, we also plan to implement other density-based clustering methods in the framework of DBANN. Besides, it is interesting to set up a specific public datasets for algorithms comparisons on overlapping problems.
References
Qiwei H, Chakhar S, Siraj S, Labib A (2017) Spare parts classification in industrial manufacturing using the dominance-based rough set approach. Eur J Oper Res 262(3):1136–1163
Li Z, Wang Y, Wang K (2019) A deep learning driven method for fault classification and degradation assessment in mechanical equipment. Comput Ind 104:1–10
Lei K, Xie Y, Zhong S, Dai J, Yang M, Shen Y (2019) Generative adversarial fusion network for class imbalance credit scoring. Neural Comput Appl 32:8451–8462
Villuendas-Rey Y, Rey-Benguría CF, Ferreira-Santiago Á, Camacho-Nieto O, Yáñez-Márquez C (2017) The naïve associative classifier (NAC): a novel, simple, transparent, and accurate classification model evaluated on financial data. Neurocomputing 265:105–115
Shoaran M, Haghi BA, Taghavi M, Farivar M, Emami-Neyestanak A (2018) Energy-efficient classification for resource-constrained biomedical applications. IEEE J Emerg Sel Top Circuits Syst 8(4):693–707
Lowrance CJ, Lauf AP (2019) An active and incremental learning framework for the online prediction of link quality in robot networks. Eng Appl Artif Intell 77:197–211
Guo H, Li Y, Shang J, Mingyun G, Huang Y, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst Appl 46:405–416
Jian C, Gao J, Ao Y (2016) A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193:115–122
Raj V, Magg S, Wermter S (2016) Towards effective classification of imbalanced data with convolutional neural networks. In: IAPR workshop on artificial neural networks in pattern recognition. Springer, pp 150–162
Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R (2018) Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst 29(8):3573–3587
García S, Zhang Z-L, Altalhi A, Alshomrani S, Herrera F (2018) Dynamic ensemble selection for multiclass imbalanced datasets. Inf Sci 445:22–37
Zhang Z, Krawczyk B, Garcìa S, Rosales-Pérez A, Herrera F (2016) Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data. Knowl Based Syst 106(C):251–263
Zhang ZL, Luo XG, González S, García S, Herrera F (2018) DRCW-ASEG: one-versus-one distance-based relative competence weighting with adaptive synthetic example generation for multi-class imbalanced datasets. Neurocomputing 285(12):176–187
Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Canadian conference on artificial intelligence. Springer, pp 220–231
Tang Y, Gao J (2007) Improved classification for problem involving overlapping patterns. IEICE Trans Inf Syst 90(11):1787–1795
Peng P, Wang J (2019) Wear particle classification considering particle overlapping. Wear 422(423):119–127
Liu CL (2006) Artificial neural networks in pattern recognition. In: Second IAPR workshop on artificial neural networks in pattern recognition (ANNPR 2006), pp 37–146
Chowdhury SA, Stepanov EA, Danieli M et al (2019) Automatic classification of speech overlaps: feature representation and algorithms. Comput Speech Lang 55:145–167
Podder A, Latha N (2017) Data on overlapping brain disorders and emerging drug targets in human Dopamine Receptors Interaction Network. Data Br 12:277–286
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
García V, Sánchez J, Mollineda R (2007) An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Iberoamerican congress on pattern recognition. Springer, pp 397–406
Prati RC, Batista GE, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Mexican international conference on artificial intelligence. Springer, pp 312–321
Yu Q, Hongye S, Guo L, Chu J (2011) A novel svm modeling approach for highly imbalanced and overlapping classification. Intell Data Anal 15(3):319–341
Alejo R, Valdovinos RM, García V, Horacio Pacheco-Sanchez J (2013) A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recogn Lett 34(4):380–388
Wasikowski M, Chen X (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22(10):1388–1400
Xia S-Y, Xiong Z-Y, He Y, Li K, Dong L-M, Zhang M (2014) Relative density-based classification noise detection. Optik Int J Light Electron Opt 125(22):6829–6834
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE–IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203
Orriols-Puig A, Bernadó-Mansilla E, Goldberg DE, Sastry K, Lanzi PL (2009) Face twise analysis of XCS for problems with class imbalances. IEEE Trans Evol Comput 13(5):1093–1119
Prati RC, Batista GE, Monard MC (2004) Learning with class skews and small disjuncts. In: Brazilian symposium on artificial intelligence. Springer, pp 296–306
Adams N (2010) Dataset shift in machine learning. J R Stat Soc Ser A (Stat Soc) 173(1):274
Subbaswamy A, Saria S (2018) Counterfactual normalization: proactively addressing dataset shift and improving reliability using causal mechanisms. arXiv preprint arXiv:1808.03253
Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):1–300
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Fernández A, del Jesus MJ, Herrera F (2015) Addressing overlapping in classification with imbalanced datasets: a first multi-objective approach for feature and instance selection. In: International conference on intelligent data engineering and automated learning. Springer, pp 36–44
Alshomrani S, Bawakid A, Shim S-O, Fernández A, Herrera F (2015) A proposal for evolutionary fuzzy systems using feature weighting: dealing with overlapping in imbalanced datasets. Knowl Based Syst 73:1–17
Xiong H, Wu J, Liu L (2010) Classification with class overlapping: a systematic study. In: Proceedings of the 1st international conference on E-business intelligence (ICEBI2010). Atlantis Press
Vorraboot P, Rasmequan S, Chinnasarn K, Lursinsap C (2015) Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms. Neurocomputing 152:429–443
Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19
Vuttipittayamongkol P, Elyan E, Petrovski A, Jayne C (2018) Overlap-based undersampling for improving imbalanced data classification. In: International conference on intelligent data engineering and automated learning. Springer, Cham, 2018
Liu N, Xing X, Li Y, Zhu A (2019) Sparse representation based image super-resolution on the knn based dictionaries. Opt Laser Technol 110:135–144
Kuzhali SE, Suresh DS (2018) Patch-based denoising with k-nearest neighbor and SVD for microarray images. In: Computer science on-line conference. Springer, pp 132–147
Kriminger E, Principe JC, Lakshminarayan C (2012) Nearest neighbor distributions for imbalanced classification. In: The 2012 international joint conference on neural networks (IJCNN). IEEE, pp 1–5
García V, Mollineda RA, Sánchez JS (2008) On the k-nn performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280
Dubey H, Pudi V (2013) Class based weighted k-nearest neighbor over imbalance dataset. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 305–316
Harshita P, Thakur GS (2016) A hybrid weighted nearest neighbor approach to mine imbalanced data. In: Proceedings of the international conference on data mining (DMIN). The Steering Committee of the World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), p 106
Harshita P, Thakur GS (2018) An improved fuzzy K-nearest neighbor algorithm for imbalanced data using adaptive approach. IETE J Res 2018:1–10
Zhang X, Li Y (2011) A positive-biased nearest neighbor algorithm for imbalanced classification. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 293–304
Zhang X, Li Y, Kotagiri R, Lifang W, Tari Z, Cheriet M (2017) k rare-class nearest neighbor classification. Pattern Recogn 62:33–44
Mullick SS, Datta S, Das S (2018) Adaptive learning-based k-nearest neighbor classifiers with resilience to class imbalance. IEEE Trans Neural Netw Learn Syst 99:1–13
Wang J, Neskovic P, Cooper LN (2007) Improving nearest neighbor rule with a simple adaptive distance measure. Pattern Recogn Lett 28(2):207–213
İnkaya T (2015) A density and connectivity based decision rule for pattern classification. Expert Syst Appl 42(2):906–912
Van Hulse J, Khoshgoftaar TM, Napolitano A (2010) A novel noise filtering algorithm for imbalanced data. In: 2010 9th international conference on machine learning and applications. IEEE, pp 9–14
Kang Q, Chen XS, Li S, Zhou M (2017) A noise filtered under-sampling scheme for imbalanced classification. IEEE Trans Cybern 47(12):4263–4274
Schubert E, Sander J, Ester M, Kriegel HP, Xiaowei X (2017) Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Trans Database Syst (TODS) 42(3):19
Czerniawski T, Sankaran B, Nahangi M, Haas C, Leite F (2017) 6D DBSCAN-based segmentation of building point clouds for planar object classification. Autom Constr 88:44–58
Das B, Krishnan NC, Cook DJ (2014) Handling imbalanced and overlapping classes in smart environments prompting dataset. In: Yada K (ed) Data mining for service. Springer, Berlin, pp 199–219
Alcalafdez J, Sánchez L, García S, Del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM (2009) KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318
Chawla NV, Bowyer KW, Hall LO, Philip Kegelmeyer W (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Zhang J, Shi H (2019) Kd-tree based efficient ensemble classification algorithm for imbalanced learning. In: 2019 international conference on machine learning, big data and business intelligence (MLBDBI), pp 203–207
Lu Y, Cheung YM, Tang YY (2016) Hybrid sampling with bagging for class imbalance learning. In: Pacific-Asia conference on knowledge discovery and data mining. Springer International Publishing
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197
Demšar J (2010) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Iman RL, Davenport JM (1980) Approximations of the critical region of the fbietkan statistic. Commun Stat Theory Methods 9(6):571–595
Ester M, Kriegel H-P, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96:226–231
Bader-El-Den M, Teitei E, Perry T (2019) Biased random forest for dealing with the class imbalance problem. IEEE Trans Neural Netw Learn Syst 30(7):2163–2172
Acknowledgements
The authors would like to thank the editor and reviewers for their useful comments and suggestions, which are of great help in improving the quality of the paper. This work is financially supported by the National Science Foundation of China (NSFC Proj. 71831006, 71801065, and 71771070), Zhejiang Provincial Natural Science Foundation of China under Grant No. LZ20G010001 and the Promotion China Ph.D Program from BMW Briliance Automotive.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yuan, BW., Luo, XG., Zhang, ZL. et al. A novel density-based adaptive k nearest neighbor method for dealing with overlapping problem in imbalanced datasets. Neural Comput & Applic 33, 4457–4481 (2021). https://doi.org/10.1007/s00521-020-05256-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-05256-0