1 Introduction

Imbalance classification, which has been widely applied in different scenarios including industrial manufacturing [1, 2], financial management [3, 4], biomedical engineering [5], information technology [6] and etc., is one of the critical issues in machine learning and data mining. Imbalanced datasets indicate a skewed distribution, namely the instances of one class outnumber the instances of other classes. Most of the standard classifiers tend to bias toward the majority class thus leading to high misclassification rate of minority instances. Imbalanced problem is commonly viewed as a main challenge to classification and has attracted great attention [7].

Existing solutions for imbalanced problem can be roughly grouped into four categories: resampling techniques, algorithm modification methods, cost-sensitive learning approaches and ensemble learning methods.

  • Resampling techniques aim to rebalance the training dataset by means of some mechanisms to generate a more or less balanced class distribution which is suitable for standard classifiers [8, 9].

  • Algorithm modification methods try to adjust the structure of standard classifiers to diminish the effect caused by class imbalance [10].

  • Cost-sensitive learning approaches generally consider higher cost for minority class to compensate for the scarcity of minority data [11].

  • Ensemble learning methods are originally developed to enhance the classification ability by combining different single classifiers. Moreover, researchers modify ensemble algorithms to adapt to imbalanced problem and show promising results [12,13,14].

In addition to imbalanced problem, overlapping between classes is convinced as another factor which degrades the learning performance [15, 16]. Overlap appears when a region contains almost equal numbers of instances from different classes. This situation results in a roughly same prior probability for each class and thus brings a strong handicap for classification. Overlapping problem is pervasive in many real-world applications such as fault diagnosis [17], character recognition [18], speech classification [19] and drug design [20]. In these scenarios, instances from different classes usually have similar characteristics in the feature space. For example, in character recognition, letters ‘O’, ‘o’ and numeral ‘0’ have almost identical shape which results in an overlapping region in the feature space therefore hard to separate. Previous investigations have shown that overlap degrades classification performance even more severely than imbalance [21]. To clearly present the relationship between two factors, a series of experiments are conducted by varying the degree of imbalance and overlap in the dataset. The conclusions state that learning algorithms can yield competitive performance when dataset has low overlapping degree combined with high imbalance ratio, but they are hard to achieve desirable results in high overlapping degree even imbalance ratio is low. This demonstrates that overlap is the main factor of classification degeneration [21,22,23]. Furthermore, Denil and Trappenberg [15] took size of dataset into consideration, and the study revealed that in small datasets the learning process is hindered by imbalance and overlap, respectively. However, when training data are sufficient, two factors jeopardize the learning performance interact.

Currently, most of the researches deal with overlapping and imbalanced problems separately. However, in practical application level, overlapping problem frequently occurs in imbalanced data which poses a greater challenge to classification. Although a few papers attempt to consider both factors as a whole [24, 25], the structures of related algorithms are too complex to implement. To fill this gap, we propose a density-based adaptive nearest neighbor method (DBANN) which can deal with these two problems simultaneously with a simple structure. The main idea of DBANN is to develop an adaptive distance adjustment strategy which devotes to defining and making use of reliable query neighbors. To the best of our knowledge, our approach is the first kNN-based method which aims to combat both imbalanced and overlapping problems. The main contributions of our study can be summarized as follows:

  • We propose a density-based kNN method named DBANN which can handle imbalanced and overlapping problems simultaneously.

  • To enhance classification ability, we develop a distance adjustment strategy using density-based methods to adaptively find out the most reliable query neighbors.

  • To validate the effectiveness of DBANN, we compare with other state-of-the-art methods on 16 synthetic datasets and 41 real-world datasets, respectively.

The outline of this paper is organized as follows: The related works are described in Sect. 2. In Sect. 3, we introduce our proposed method DBANN. Section 4 presents extensive experiments on both synthetic and real-world datasets. Results and discussion are shown in Sect. 5. Finally, Conclusions are presented in Sect. 6.

2 Related works

2.1 Overlapping problem in imbalanced datasets

In binary classification, imbalanced data refer to the distribution when instances of one class outnumber the other one, as shown in Fig. 1a. In this situation, minority class is hard to be recognized by standard classifiers which prefer to take a good coverage of majority class for achieving desirable global performance. However, in real applications, minority class always contains critical information we need, such as scrap part data within all products, patient information among all people. Therefore, it is imperative to take a deep insight into the data intrinsic characteristic in imbalanced data. With this in mind, we realize that imbalanced data dose not hinder learning ability by its own but together with some other factors, such as size of a dataset [21, 26], noise [27, 28], small disjunct [29, 30] and data shift [31, 32].

Fig. 1
figure 1

Examples of imbalanced and overlapping distribution

Overlap occurs in the region where both classes co-exist and is viewed as one of the main obstacles to classification [15], as depicted in Fig. 1b. In such region, the probability of each class is approximately equal which gives rise to high misclassification rate. In order to quantify overlapping degree for individual feature dimension, Ho and Basu [33] proposed a metric called maximum Fisher’s discriminant ratio (F1), as shown in Eq. (1) where \(\mu_{1} ,\mu_{2,} \sigma_{1} ,\sigma_{2}\) indicate the means and variances of the two classes, respectively. For a multidimensional dataset, the maximal \(f\) among all the features is defined as F1. Datasets with a low value of F1 will have a high degree of overlap and vice versa. To overcome overlapping problem, some researchers tended to change data distribution before modeling. Tang [16] firstly transformed original data into a more separated data distribution throughout overlapping pattern extraction and rough set theory, and then a proposed DR-SVM is implemented on the transformed data. Batista et al. [34] used data cleaning techniques to cope with highly overlapping data and achieved desirable results. Similarly, in order to better prepare the data for classification, other pre-processing methods such as data selection and feature selection are proposed in [35, 36]. However, the pre-processing methods may involve the risk of noise introduction or information loss. Xiong et al. [37] found that modeling the overlapping and non-overlapping regions, respectively, is a promising scheme for solving class overlapping problem. Follow this line of thinking, Vorraboot et al. [38] first partitioned training data into non-overlapping region, borderline region and overlapping region. Afterward, different techniques were employed for different regions. Finally, the outputs of all techniques were combined. Nevertheless, the study is only suitable for two Gaussian classes with independent and identical distributions.

$$f = \frac{{\left( {\mu_{1} - \mu_{2} } \right)^{2} }}{{\sigma_{1}^{2} + \sigma_{2}^{2} }}$$
(1)

In the real-world application cases, overlapping problem and imbalanced problem co-exist frequently in the dataset but are always ignored in previous studies. Figure 1c marks such two overlapping regions with circles. Comparing with Fig. 1b, we can find that overlapping regions here exhibit comparative high density in the perspective of global distribution. That is partly attributed to the sparsity of minority class which inversely accentuates the compactness of two overlapping regions. Additionally, the imbalance ratio is different in overlapping region and other regions. It is worth noting that for some learning algorithms which are based on a divide and conquer strategy [39], the variation of class distribution in different regions may pose a threat to the classification performance. Besides, overlapping and imbalanced problems also influence those algorithms which are sensitive to data density such as k nearest neighbors.

To address overlapping and imbalanced problems, Alejo et al. [25] developed a hybrid method which combines a modified back propagation (MBP) with a gabriel graph editing technique (GGE). MBP copes with imbalanced issue and GGE is responsible for overlapping problem. Vuttipittayamongkol et al. [40] proposed an overlap-based under-sampling method (OBU). Based on elimination of majority instances from overlapping region, OBU improves the visibility of minority instances. So far, the related studies are far from enough.

2.2 kNN-based methods for dealing with overlapping or imbalanced datasets

kNN is one of the typical non-parametric approaches which is widely applied in diverse domains due to its simple but powerful decision rule [41, 42]. However, when encountering imbalanced class distribution, kNN tends to lose power on yielding competitive results [43, 44]. To cope with it, kNN is modified to incorporate immunity against the influence of imbalance. Concretely, Kriminger et al. [43] proposed a class conditional nearest neighbor distribution algorithm (CCNND). To mitigate the effect of imbalanced distribution, for each class, CCNND calculates the number of training instances which satisfies a specified distance condition within the k nearest neighbors of a query instance. Afterward, an empirical cumulative distribution function (CDF) is built and the probability for each class is computed. Dubey and Pudi [45] provided a weighting scheme (here after called W-kNN) to address imbalanced issue. W-kNN assigns weight to each class based on the misclassification rate obtained by traditional kNN. Patel [46] developed a hybrid weighted strategy (here after called H-kNN). The main advantage is the use of dynamic k value, i.e., small k for minority class and large k for majority class. Therefore, H-kNN improves the ability to fully mine the information from imbalanced distribution. On this basis, the same author took fuzzy rule-based classification into consideration and proposed an improved fuzzy k-nearest neighbor (here after called F-kNN) [47]. Based on fuzzy membership, the query instance is allowed to know prior that how much its neighbors belong to a class. Zhang and Li [48] presented a minority-biased nearest neighbor algorithm called PNN. In order to handle the inappropriate probability estimation for minority class, PNN fixes the number of minority query neighbors. For example, m-PNN means that there must be m minority instances in query neighbors. Therefore, the number of query neighbors changes dynamically to ensure enough instances for probability estimation for both classes. k rare-class nearest neighbor classification (kRNN) [49] boosts PNN by updating the dynamic query neighbors strategy. The new strategy reinforces the analysis of distributions around query instances. kRNN can handle not only inter-class imbalance but also within class imbalance. Mullick et al. [50] proposed an adaptive learning kNN method called Ada-kNN. It uses a class-based global imbalance handling scheme (GIHS) to compensate for the disadvantage of minority data scarcity. To assign global weight for each class, GIHS considers both ideal class probability (balanced distribution) and reality class probability (imbalanced data distribution).

Overlapping problem is another obstacle for learning algorithms with no exception to kNN, as stated in Sect. 2.1. Garcia et al. [44] investigated the behavior of kNN when overlap exists in imbalanced data. The results reveal that when imbalance ratio in overlapping region equals to global imbalance ratio, i.e., majority class dominates the overlapping region, true positive rate (TPR) drops with the increase in overlapping degree. Conversely, when minority class turns to be the most represented class in overlapping region, the TPR increases on the opposite. Additionally, they pointed out that imbalance ratio in overlapping region accounts more than the size of overlapping region and global imbalance ratio. Wang et al. [51] proposed an extremely simple but well performed algorithm called A-kNN. It aims to form reliable query neighbors for final decision. To do so, A-kNN modifies the distance metric to move the reliable instances closer to the query instance. Hence, even the query instance locates in the overlapping region which is viewed as ambiguous and untrusted area, the query neighbors selected after distance adjustment are reliable. Although A-kNN is an effective solution for overlapping problem, it cannot handle imbalanced issue.

Even though some efforts are made for kNN to enhance the classification performance, there are still some drawbacks remaining to be improved in further research. Firstly, previous kNN-based methods treat imbalanced and overlapping problems separately though the two factors always co-exist in the real-world applications. Secondly, some modified kNN methods choose pivot instances as query neighbors for decision making, nevertheless, the choice criterion they use considers either global or local information. When data density varies a lot in different regions, especially when imbalance ratio is significantly different among these regions, the choice criterion cannot work effectively anymore. Finally, previous works ignore the influence of noisy instances which can jeopardize the classification performance significantly. In the following sections, we will solve above concerns with a novel method.

3 Combating overlapping and imbalanced problems using density-based adaptive k nearest neighbor method (DBANN)

In this section, we expect to conquer overlapping and imbalanced problems by a density-based strategy. For ease of discussion, in this paper, we focus on binary class problem even though it can be generalized to multi-class problem. Considering a given training dataset \(D\) with \(N\) instances, \(D = \left\{ {\left( {x_{1} ,y_{1} } \right),\left( {x_{2} ,y_{2} } \right) \ldots \left( {x_{{n}} ,y_{{n}} } \right)} \right\}\), where \(x_{i} \in X\) is a training instance in \(m\) dimensions, \(x\) is a query instance. Traditional kNN first determines the \(k\) nearest neighbors of query instance based on euclidean distance in Eq. (2). Then, majority vote in Eq. (3) is used to classify \(x\). Besides, \(IR\) in Eq. (4) indicates the imbalance ratio of a dataset, where \(N_{{\rm minority}}\) refers to the number of minority instances and \(N_{{\rm majority}}\) refers to the number of majority instances. It is worth noting that we use \({\text{IR}}_{{\rm global}}\) as the imbalance ratio of the whole dataset and \({\text{IR}}_{{\rm local}}\) as the imbalance ratio in a specified region in the sequent sections.

$$d\left( {x,x_{i} } \right) = \left( {\sum\limits_{j = 1}^{m} {\left| {x^{j} - x_{i}^{j} } \right|^{2} } } \right)^{1/2}$$
(2)
$$f\left( x \right) = \text{sgn} \left( {\sum\limits_{i}^{k} {\left( {y_{i} } \right)} } \right)$$
(3)
$${\text{IR}} = \frac{{N_{{\rm majority}} }}{{N_{{\rm minority}} }}$$
(4)

3.1 Description of reliable query neighbors

Previous studies have shown the significance of query neighbors in classification for kNN [48, 52]. Inspired by this, the main idea of DBANN is to seek for reliable query neighbors which are used for majority vote. Different from traditional kNN, DBANN selects query neighbors depending not only on distance but also data characteristic, i.e., imbalanced and overlapping distribution in local and global region. Besides, we expect the query neighbors can adaptively change to adapt different imbalanced and overlapping degrees. Therefore, first of all, we describe the characteristics of reliable query neighbors as follows:

  • There is no doubt that the query neighbors should be the instances locating near around the query instance so as to include more representative information.

  • Due to the scarcity of minority data, the query neighbors should be biased toward the minority class.

  • Since instances in overlapping region are hard to separate, it is suggested to view these instances unreliable and lower the probability of the selection for query neighbors.

  • In most cases, noisy instance is an obstacle for classification [28, 53, 54]. Hence, it should be avoided to be selected as query neighbors.

  • Since kNN is proved to be sensitive to data complexity and class density [44], it is desirable to involve density and class distribution factors into consideration.

3.2 A density-based adaptive k nearest neighbor method (DBANN)

Most of previous researches of kNN method concentrate on addressing imbalanced problem but overlook the influence of overlap. Literature [51] proposes an adaptive kNN method (A-kNN) to deal with overlapping problem. Different from traditional kNN, A-kNN modifies distance metric according to overlapping degree. First of all, for each training instance \(x_{i}\), A-kNN creates a reliable coefficient \(r_{i}\). It is the distance from a training instance \(x_{i}\) to another training instance xj which is the nearest neighbor belonging to different class from \(x_{i}\), as listed in Eq. (5). Based on observation, we can easily find that a lower \(r_{i}\) value means \(\left( {x_{i} ,x_{j} } \right)\) locate closer to each other which implies that high overlapping degree exists in this region; therefore, \(x_{i}\) is viewed as an unreliable instance and vice versa. With this in mind, we know that \(r_{i}\) value can measure reliable degree of \(x_{i}\). A high \(r_{i}\) value implies that \(x_{i}\) has high reliable degree and helpful in classification, on the contrary, a low \(r_{i}\) value indicates that \(x_{i}\) is useless and unreliable. After obtaining \(r_{i}\) for each training instance, A-kNN adjusts and forms a new distance metric by Eq. (6). Finally, the output is obtained by Eq. (3). Technically speaking, A-kNN is a local method to handle overlapping problem. However, the imbalance issue is not considered.

$$r_{i} = \mathop {\hbox{min} }\limits_{{l:y_{i} \ne y_{j} }} d\left( {x_{i} ,x_{j} } \right)$$
(5)
$$d_{{\rm new}} \left( {x,x_{i} } \right) = \frac{{d\left( {x,x_{i} } \right)}}{{r_{i} }}$$
(6)

In this study, we extend the concept of \(r_{i}\) to handle both imbalanced and overlapping problems.

Concretely, in the first step, we cluster training data into several clusters and noisy instances by a density-based method (we will introduce it in Sect. 3.3). To further characterize the clusters, we consider overlapping issue and divide clusters into two types with definitions listed as follows.

Definition 1

Overlapping cluster indicates that the cluster contains both majority and minority instances.

Definition 2

Non-overlapping cluster indicates that the cluster contains only majority or minority instances.

Therefore, after clustering, training data can be divided into six parts: (a) minority noisy instances, (b) majority noisy instances, (c) majority instances in overlapping cluster, (d) minority instances in overlapping cluster, (e) majority instances in non-overlapping cluster and (f) minority instances in non-overlapping cluster. Figure 2 depicts the six parts in detail.

Fig. 2
figure 2

Description of DBANN method

Afterward, in the second step, we assign reliable coefficient \(r_{i}\) to each training instance \(x_{i}\). Different from A-kNN, we capture the distribution variation and take noise factor into consideration. Specially, we assign \(r_{i}\) for training instances in each part as follows:

  1. (a)

    For minority noisy instances, \(r_{i}\) is assigned as the distance from \(x_{i}\) to the nearest majority neighbor.

  2. (b)

    For majority noisy instances, \(r_{i}\) is assigned as a randomly little positive value.

  3. (c)

    For minority instances in overlapping cluster, \(r_{i}\) is assigned as the distance from \(x_{i}\) to the \({\text{IR}}_{{\rm local}}\)th nearest majority neighbor. \({\text{IR}}_{{\rm local}}\) equals to the imbalance ratio in the corresponding cluster which represents the local distribution. Obviously, high imbalance ratio expands the detection radius of \(x_{i}\) and obtains larger \(r_{i}\) accordingly. Noticeably, \(r_{i}\) also relates to the density of a cluster. A cluster with a high density means a large amount of instances locating in a small region together, and even detection radius of \(r_{i}\) expands to the \({\text{IR}}_{{\rm local}}\)th nearest majority neighbor it may probably increase by only a small value compared with low-density clusters.

  4. (d)

    For majority instances in overlapping cluster, \(r_{i}\) is assigned as the distance from \(x_{i}\) to the nearest minority neighbor.

  5. (e)

    For minority instances in non-overlapping cluster, \(r_{i}\) is assigned as the distance from \(x_{i}\) to the \({\text{IR}}_{{\rm global}}\) th nearest majority neighbor. Different from \({\text{IR}}_{{\rm local}}\) in overlapping situation, \({\text{IR}}_{{\rm global}}\) here is calculated by global imbalance ratio.

  6. (f)

    For majority instances in non-overlapping cluster, \(r_{i}\) is also assigned as the distance from \(x_{i}\) to the nearest minority neighbor.

In the next step, we adjust the distance metric by Eq. (6). We can find that the new distance metric depends on two conditions, \(r_{i}\) and the euclidean distance \(d\left( {x,x_{i} } \right)\). A training instance \(x_{i}\) is reliable when it locates near to the query instance \(x\) as well as possessing higher \(r_{i}\) value. It is obvious that, after distance adjustment the reliable instances are pulled closer to query instance and the unreliable ones are pushed away.

Finally, majority vote is implemented in new distance metric by Eq. (3). From the procedure introduced above, we can find that in \(r_{i}\) assigning procedure, our method considers not only local but also global distribution of the dataset.

Figure 2 illustrates the process of DBANN. The star and circle represent the majority and minority class, respectively, and the shape of hollow and solid represent the original and current distribution (after distance adjustment). Here, we only focus on the change of the instances marked with letters. Firstly, we concentrate on points a and b which distribute in the overlapping cluster. From the graph, we can see two instances distributing closer to each other which indicates that high overlapping degree exists. Therefore, they are viewed as unreliable points and ought to be pushed far away from the query instance by Eqs. (5) and (6). Nevertheless, except for overlapping degree, imbalanced issue is also considered. As a result, only majority point b is pushed to B whereas point a is pulled to A which is closer to query instance due to local imbalanced ratio (\({\text{IR}}_{{{\text{local}}2}} = 2\)). Similar distance adjustment method is also suitable for points c and d, and the only difference is that point c is put closer to query instance due to higher local imbalance ratio (\({\text{IR}}_{{\rm local1}} = 7\)). Additionally, majority noisy point e is pushed far away due to the little number setting and minority noisy point f is pushed to F. As for minority point g in non-overlapping cluster, global imbalance ratio (\({\text{IR}}_{{\rm global}} = 3\)) is used to pull it to G, and majority point h is moved to H. Consequently, after distance adjustment, the query neighbors are point A, C, G and query instance is predicted as minority. Details of DBANN are listed in Algorithm 1.

figure a

3.3 Density-based clustering algorithm

In Sect. 3.2, we take advantage of density-based clustering method to divide training data into different parts. Technically speaking, it is a framework of our method which implies that many existing density-based clustering algorithms are optional in DBANN. In this paper, we choose DBSCAN as our method.

DBSCAN is a typical density-based clustering algorithm which defines cluster as a region of high dense points separated by regions of lower dense points. It has attracted much attention by its desirable properties including arbitrary shaped clusters, automatic cluster number identification and noise detection [55, 56]. Additionally, as a useful method to capture data distribution, DBSCAN is always implemented by combing with other algorithms to enhance classification ability, especially when facing severely complex data distribution [38, 57]. Generally, DBSCAN characterizes density variation by resorting to two input parameters, a positive value \(eps\) and a positive constant integer \(Minpts\). On this basis, some definitions of DBSCAN are listed as follows:

Definition 1

\(eps{\text{-}}neighborhood\) of a point \(p\) indicates the points within the radius \(eps\) around \(p\).

Definition 2

A point \(p\) is a core point if the number of its \(eps{\text{-}}neighborhood\) is more than \(Minpts\).

Definition 3

A point \(p\) is directly density-reachable from a point \(q\) if \(q\) is a core point and \(p\) is its \(eps{\text{-}}neighborhood\).

Definition 4

A point \(p\) is density-reachable from a point \(q\) if there is a chain of points \(p_{1} ,{\kern 1pt} {\kern 1pt} \ldots {\kern 1pt} p_{n}\), \(p_{1} = q\), \(p_{n} = p\) which satisfies that \(p_{i + 1}\) is directly density-reachable from \(p_{i}\).

Definition 5

A border point \(p\) is the \(eps{\text{-}}neighborhood\) of a core point \(q\) which has fewer neighbors than \(Minpts\) within the same \(eps\) radius.

Definition 6

A noisy point \(p\) is the point neither a core point nor a border point.

Initially, DBSCAN arbitrarily selects a point \(p\) and retrieves all \(eps{\text{-}}neighborhood\), this process is defined as QueryNeighbour. If the number of \(eps{\text{-}}neighborhood\) is larger than \(Minpts\), point \(p\) is assigned as core point and thus forming a new cluster, otherwise, \(p\) is assigned as a noisy point. Subsequently, the cluster expands by adding unvisited density-reachable points iteratively. The process is repeated until every unvisited point is marked either in a cluster or a noisy point. Noticeably, even a point is marked as a noisy point initially, it may be transformed to a border point of other cluster during cluster expanding process. Finally, DBSCAN forms several clusters and noisy points. Figure 3 demonstrates the process of DBSCAN (\(Minpts\) = 4, \(eps\) is indicated by the circles). As can be seen from the graph, point A is marked as core point at the beginning and thus creates a new cluster. Afterward, the cluster expands based on density measurement. It involves all blue points in the cluster until it reaches the yellow border points (F, G) which are the edge of the cluster. Due to low density, points (H, I) are marked as noisy points. The detail of DBSCAN is shown in Algorithm 2.

figure b
Fig. 3
figure 3

Process of DBSCAN

4 Experiments

In this section, experiments are carried out on 16 synthetic datasets and 41 real-world datasets to validate the effectiveness of DBANN method. The details of datasets are described in Sect. 4.1. In Sect. 4.2, we list the comparative algorithms and the corresponding parameters setting. Besides, the evaluation metrics and statistical tests are introduced in Sect. 4.3. All experiments are carried out by using python (Version 2.7.14).

4.1 Datasets used in the experiments

In this section, experiments are conducted on both synthetic and real-world datasets. Specially, synthetic data are generated for both classes from bivariate normal distributions. On the one hand, to explore the classification performance in different overlapping degrees, the mean vector of minority class is fixed at [0.00, 0.00], and the mean vector of majority class is set at [0.05, 0.05], [0.50, 0.50], [1.30, 1.30], [2.70, 2.70], respectively which represents four overlapping degrees. In detail, mean vector of [0.05, 0.05] means the data centers of two classes are closed; thus, severe overlapping region exists. Mean vector of [2.7, 2.7] indicates that the two centers locate far away so that the overlapping degree is rare. On the other hand, synthetic data are also generated into four different imbalance ratios by changing numbers of both classes. The description of synthetic datasets is shown in Table 1. As for real-world applications, we select 41 datasets from KEEL repository [58] refers to previous research [35], as shown in Table 2. KEEL is an open source which provides benchmark datasets for assessing the behaviors of the algorithms in different scenarios [58]. Datasets are ordered by overlapping degree according to Fisher’s discriminant ratio (F1), in which we can divide them into two parts, low overlapping datasets with F1 > 1.6 and high overlapping datasets with F1 < 1.6. Imbalance ratio (IR) of datasets ranges from 1.8 to 68.1, as shown in Table 1.

Table 1 Introduction of 16 synthetic datasets
Table 2 Introduction of 41 real-world datasets

4.2 Algorithms and parameter settings

In our experiments, DBANN is compared with other algorithms and strategies. The comparative algorithms can be divided into three directions: (a) kNN-based methods: The algorithms derived from k nearest neighbor are modified to address imbalanced or overlapping problems, including W-kNN [45], kRNN [49], F-kNN [47], H-kNN [46] and standard classifier kNN. (b) Generality-oriented learning algorithms and strategies: CART decision tree, support vector machine (SVM) together with data balancing methods SMOTE [59] and overlap-based under sampling (OBU) [40] which are popular in handling imbalanced and overlapping problems. (c) Ensemble algorithms: Kd-tree-based efficient ensemble (KDE) [60], hybrid sampling with bagging (HSB) [61] and RUSBoost (RUS) [62] represent three effective ensemble methods of bagging and boosting, respectively, for imbalanced problem.

  • For kNN-based method W-kNN, kNN, F-kNN, H-kNN, DBANN, kRNN, parameter \(k\) is chosen from original literature which is set to 3, 3, 3, 3,3, 3,1, respectively. The other parameter is set according to the original literature. For DBANN, \(Minpts\) is set to 4, and \(eps\) is chosen as the optimal value from the range [0.01, 200] by 10-fold cross-validation.

  • For generality-oriented algorithms, support vector machine is implemented with linear kernel which shows desirable performance in selected datasets. SMOTE and OBU are conducted to generate an equal number between minority class and majority class before classification.

  • For ensemble algorithms, the base classifier is decision tree, and the number of base classifier is {10,10,40} for KDE, HSB and RUS, respectively, according to original literature. In KDE, k = 3, \(\varepsilon = 0.1\), and in HSB, k = 3, I = {0, 0.2, 0.4, 0.6, 0.8, 1}.

4.3 Performance measures and significance statistical test

All experiments are carried out by employing 10-fold cross-validation. The confusion matrix in Table 3 shows four types of classification results. On this basis, two indicators geometric means metric (GM) and F-measure metric (F1) are used to evaluate the classification performance, and detailed definitions are shown in Eqs. (7)–(11). It can be seen from Eqs. (10) and (11) that GM considers the proportion of correctly classified instances in both minority and majority classes, while F1 focuses more on the average performance of precision and recall.

Table 3 Confusion matrix for binary classification

To evaluate if significant difference exists among experimental algorithms, it is necessary to use statistical tests. Here we adopt non-parametric statistical Friedman test and Bonferroni–Dunn post hoc test [63]. Friedman test is first employed to detect differences among all the algorithms in two indicators. After that, Bonferroni–Dunn is applied to check out if DBANN performs significantly better than comparative algorithms.

$${\text{Recall}} = {\text{Sensitivity}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}}$$
(7)
$${\text{Precision}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}}$$
(8)
$${\text{Specificity}} = \frac{\text{TN}}{{{\text{TN}} + {\text{FP}}}}$$
(9)
$$F1 = \frac{{2 \cdot {\text{Recall}} \cdot {\text{Precision}}}}{{{\text{Recall}} + {\text{Precision}}}}$$
(10)
$${\text{GM}} = \sqrt {{\text{Sensitivity}} \cdot {\text{Specificity}}}$$
(11)

To implement Friedman test, we first rank the performance of \(K\) algorithms on each dataset, the best performance ranks 1, the worst ranks \(K\). When tie appears, average rank is assigned to each algorithm. Subsequently, we compute the Friedman statistic \(\chi_{F}^{2}\) by Eqs. (12) and (13). Specifically, \(r_{ij}\) denotes the rank of the \(j\)th of \(K\) algorithms on the \(i\)th of \(N\) datasets. As a result, \(R_{j}\) represents the average rank of the \(j\)th algorithm. Moreover, Iman and Davenport [64] found that Friedman’s \(\chi_{F}^{2}\) was undesirably conservative and created a better statistic value \(F_{F}\) according to F-distribution with \(\left( {K - 1} \right)\) and \(\left( {K - 1} \right)\left( {N - 1} \right)\) degrees of freedom as shown in Eq. (14). Critical value \(q_{\beta }\) is calculated by \(q_{\beta } = F\left( {\alpha ,{\rm K} - 1,\left( {{\rm K} - 1} \right)\left( {N - 1} \right)} \right)\). When \(F_{F} > q_{\beta }\), null-hypothesis is rejected, i.e., significant difference exists among the comparative algorithms and vice versa. Furthermore, once the null-hypothesis is rejected, the post hoc tests Bonferroni–Dunn test is proceeded to conduct pairwise comparisons between DBANN and other algorithms. Here, critical value \(q_{\gamma }\) is based on the studentized range statistic divided by \(\sqrt 2\) [63]. The significant differences exist when average ranks of two algorithms differ by at least the critical difference (CD) [63]

$$R_{j} = \frac{1}{N}\sum\limits_{i} {r_{i}^{j} }$$
(12)
$$\chi_{F}^{2} = \frac{12N}{K(K + 1)}\left( {\sum\limits_{j} {R_{j}^{2} - \frac{{K(K + 1)^{2} }}{4}} } \right)$$
(13)
$$F_{F} = \frac{{(N - 1) \cdot \chi_{F}^{2} }}{{N(K - 1) - \chi_{F}^{2} }}$$
(14)
$${\text{CD}} = q_{\gamma } \sqrt {\frac{K(K + 1)}{6 \cdot N}}$$
(15)

5 Results and discussion

5.1 Analyzing the critical parameters and property of DBANN

In this section, we provide an insight into detailed properties of DBANN. We first discuss the influence of parameter \(eps\) and parameter \(k\) on classification performance. Afterward, we investigate the distribution of query neighbors in DBANN. Finally, we analyze the advantage of DBANN over other kNN-based methods.

5.1.1 \(eps\) value

\(eps\) and \(Minpts\) are two input parameters in DBSCAN. Previous researches [55, 65] reported that \(Minpts\) has little impact on the clustering results. Therefore, in this section \(Minpts\) is set to 4 and \(eps\) is varied from 0.01 to 200 to analyze its influences on classification performance. Here we choose five real-world datasets for experiments and the results are shown in Fig. 4. It is easy to realize that \(eps\) is a sensitive parameter which dominates the performance of DBANN. In general, with the increase of \(eps\), F1 and GM experience an increasing trend with fluctuation (here we call it phase I), and then the performance tends to be stable at a fixed range in the end (here we call it phase II). Noticeably, for some datasets (yeast1, glass016vs5, ablone21vs8) the optimal \(eps\) value exists in phase I, as for others (Newthyroid2, winequalityred3vs5) the optimal \(eps\) value exists in phase II. In this study, grid search is used to determine the optimal \(eps\) value.

Fig. 4
figure 4

Performance of DBANN with different eps in F1 and GM

In order to reveal the root cause behind the sensitivity of \(eps\) value, a further analysis is provided on relationship among \(eps\), clustering situation and classification performance. We demonstrate this issue based on dataset glass016vs5 and the results are shown in Table 4. We find that \(eps\) directly affects the clustering situation. When \(eps < 0.1\), all instances are defined as noisy instances and no clusters are formed. With the increase of \(eps\), more noisy instances are transferred to form clusters. When \(eps = 0. 3\) DBANN forms at most five clusters. Subsequently, clusters expand and merge until forming one big cluster with no noisy instance exists at last. As a result, F1 and GM also vary according to \(eps\) as shown in the last two columns in Table 4. Particularly, when \(eps = 2\), DBANN achieves the optimal performance (F1 = 53.53%, GM = 72.96%) with two clusters and ten noisy instances.

Table 4 Relationship among eps, clustering situation and classification performance on glass016vs5

As stated in Sect. 3.2, clustering results directly decide the choice of query neighbors. Therefore, \(eps\) which dominates the clustering situation is sensitive to classification performance of DBANN.

5.1.2 \(k\) value

To analyze the influence of \(k\) value, we fix the setting of \(eps\) at the optimal value and vary \(k\), \(k = 1.2.3, \ldots ,{\kern 1pt} 6 0\) on five real-world datasets. The classification results in Fig. 5 show that in most datasets, the classification performance drops with the increasing \(k\) value. Especially, when \(k\) reaches up to 60, the performances reduce to 0. This can be partly explained by the imbalanced solution of DBANN. Essentially, DBANN does not generate additional synthetic instances to compensate for minority class but increase the probability of minority class in query neighbors selection process. However, when \(k\) is too large, the proportion of minority instances in query neighbors cannot increase more even if the selection probability is 100% due to the imbalanced distribution, which results in performance loss. Based on our experience, when \(k = 3\), the performance is desirable.

Fig. 5
figure 5

Performance of DBANN with different k in F1 and GM

5.1.3 Distribution of reliable query neighbors

In this section, we investigate the distribution of query neighbors by a series of experiments. We set \(k = 3\), \(eps\) at the optimal value in the whole experiments. We define the ranking for each training instance by sorting the distance from training instances to a query instance in ascending order, i.e., the nearest instance ranks 1.

We take glass0 dataset as an example to demonstrate this issue. We first implement DBANN on glass0 dataset by 10-fold cross-validation. For each run in the fold, we record the rankings of query neighbors for all query instances. After 10-fold cross-validation, we get the whole rankings. For example, we have 90 training instances and 10 query instances in onefold. In each run, we record the rankings of the three query neighbors for each query instance thus there are 30 rankings. After 10-fold cross-validation, there are totally 300 rankings which are used here for analysis. To facilitate the observation, we calculate the proportion of rankings in different intervals (1st, 2nd–3rd, 4th–5th, 6th–7th, 8th–9th, 10%th–20%th, 20%th–100%th), and the result is shown in Fig. 6a. Obviously, different from traditional kNN, the distribution of query neighbors of DBANN is not 100% \(k\) nearest neighbors anymore. Actually, the proportion of first three rankings (traditional kNN) only accounts for 13.83%, and the largest proportion of query neighbors distributes in the rankings in 10%th–20%th among all the training instances. Noticeably, the instance locates far away from the query instance (rankings in 20%th–100%th) also can be selected as query neighbors even though the proportion is only 2.19%. This partly shows that DBANN considers not only the local but also the global distribution in the selection of query neighbors.

Fig. 6
figure 6

Distribution of query neighbors in DBANN

To make a better understanding of the query neighbor selection mechanism, we expand the experiments on 16 synthetic datasets (introduced in Sect. 4.1) which can be divided into four overlapping levels as well as four imbalanced levels. We define k nearest neighbors of a query instance (traditional kNN query neighbors) as local neighbors and then analyze the proportion of local neighbors on all the query neighbors in DBANN. Higher proportion indicates decision making depends more on local distribution. Conversely, lower proportion indicates decision making is more relied on global distribution.

From the results in Fig. 6b, we can see that when overlapping degree is rare, DBANN relies more on local neighbors with the proportion approximate to 0.85. In contrast, when overlap is severe, the proportion of local neighbors is significant lower on an average. This can partly show our advantage in query neighbors selection, i.e., when data distribution is tough, DBANN adaptively expands the detection radius to search for more reliable instances even though they locate far away. Besides, Fig. 6c shows the comparisons in different imbalanced degrees, and the graph clearly demonstrates that on rare and severe imbalanced degrees, the proportion of local neighbors drops with the increase in overlapping degree. However, on severe imbalanced datasets, the proportion of local neighbors on moderate and severe overlapping datasets is lower than that on rare imbalanced datasets. Above discussions imply that DBANN can adaptively select query neighbors from local to global region in different scenarios.

5.1.4 Advantage of DBANN over other kNN-based methods

To further study the advantage of the query neighbor selection mechanism in DBANN, we compare DBANN with other kNN-based algorithms on two typical datasets glass1 and yeast2vs4, which have different overlapping degrees and imbalance ratio, as a case study.

To better demonstrate this issue, we divide each dataset into overlapping region and non-overlapping region so as to take a closer look at the performance of each algorithm in different regions (whole region, overlapping region and non-overlapping region). Inspired by [66], we use kNN (k = 5) to separate the two regions. First of all, for each instance, it is considered to be in non-overlapping region if the instance and all its 5 nearest neighbors belong to the same class, otherwise, it is considered to be in overlapping region. Secondly, we calculate imbalance ratio (IR) and Fisher’s discriminant ratio (F1) in different regions, respectively (Table 5). Finally, we run 6 kNN-based algorithms on two datasets and record their performances in different regions (Table 6). It is worth noting that F1 in Table 5 indicates overlapping degree while F1 in Table 6 indicates F-measure.

Table 5 Imbalanced ratio (IR) and Fisher’s discriminant (F1) ratio in different regions
Table 6 Performance of kNN-based methods in different regions

By observing Table 5, we note that the local distribution differs in different regions, in which the imbalance ratio is approximate to 1 in overlapping region while it is much higher in non-overlapping region. The overlapping degree (F1) is higher in overlapping region than other regions. All these results support the previous conclusion that the distribution of overlapping region is complex and hard to learn. This conclusion is also proved when we compare the performance of all kNN-based algorithms on two datasets in different regions in Table 6, in which F1 and GM value in overlapping region is significant lower than non-overlapping regions as well as the whole region. However, it is worth noting that DBANN performs better than other algorithms in overlapping region on both datasets. In glass1, DBANN achieves the best results (F1: 72.50, GM: 45.20) in overlapping region although its final result (F1: 76.23, GM: 79.72) only ranks fourth among all the algorithms. Meanwhile, it is witnessed that the performance of DBANN in non-overlapping region does not drop significantly compared with other algorithms. The same situation also occurs in yeast2vs4. Above results indicate that DBANN is able to excel in overlapping region at the cost of a small loss in non-overlapping region. This property is convinced as the main advantage of DBANN over remaining algorithms and we believe this property is beneficial from the adaptive query neighbors selection mechanism which is sensitive to the variation of local distribution.

5.2 Overall performance of DBANN

5.2.1 Performance on synthetic datasets

In this part, we validate the effectiveness of our proposed method by a bunch of experiments. Tables 7 and 8 show the comparison results of DBANN with kNN-based methods in F1 and GM on 16 synthetic datasets. The optimal result in each dataset is highlighted in bold-face. It can be found that DBANN performs better than other methods in almost all datasets in terms of average rank in F1 and GM. Particularly, when data distribution is severely overlapping, i.e., in datasets A1–A4, DBANN obtains the best average rank in both F1 (1.50) and GM (2.25). This implies the advantage of query neighbors selection mechanism in the face of extreme tough data distribution. When overlapping degree is moderate or slight (B1–B4, C1–C4), DBANN obtains the optimal results in all datasets except for GM in B1 and C1. As for imbalance issue, we observe that DBANN performs better in high imbalance ratio (1:9,1:19) with average rank 1 in GM while the average rank in low in imbalance ratio datasets (1:2,1:4) is 2.5. This demonstrates that DBANN has the ability to handle the high imbalanced distribution.

Table 7 Comparative results between DBANN and kNN-based methods in F1 on synthetic datasets
Table 8 Comparative results between DBANN and kNN-based methods in GM on synthetic datasets

Moreover, to analyze statistical significance differences in comparative methods, Friedmen test (FR) is carried out. According to F-distribution, the critical value \(q_{\beta }\) is \(F\left( {0.05,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 5 \times 15} \right) = 2.901 3\). From the results in Table 9, we can see \(F_{F} > q_{\beta }\) in both F1 and GM which indicates that there are significant differences existing among all compared methods. Subsequently, the pairwise comparisons are conducted by Bonferroni–Dunntest. The critical value \(q_{\gamma }\) for two-tailed Bonferroni–Dunn test (\(\alpha = 0.05\)) with 6 algorithms is 2.576 [63]. We highlight the algorithms which are significantly different from DBANN in bold-face. Concretely, differences exist in W-kNN, kNN, F-kNN, H-kNN in F1, and W-kNN, kNN, H-kNN in GM. Additionally, we notice that DBANN seems similar with kRNN with regard to F1 and GM in statistical test. However, we know that the targets and structures of two algorithms are totally different. kRNN tends to bias the posterior probability estimation toward the minority class based on local distribution to handle imbalanced problem whereas DBANN aims to boost performance by searching for reliable query neighbors in both local and global distribution by additionally considering overlapping issue.

Table 9 Results of the Friedmen test and the Bonferroni–Dunn test among kNN-based methods on synthetic datasets (\({\text{CD}} = 1.7038,q_{\beta } = 2.9013\))

5.2.2 Performance on real-world datasets

In this section, we compare DBANN with kNN-based methods as well as generality-oriented methods on 41 real-world datasets. Tables 10 and 11 show that the average ranks of GM and F1 of DBANN are 2.3902 (1) and 1.9268 (1), respectively, indicating that DBANN achieves better performance than other kNN-based methods. In order to obtain clear insights into the behaviors of DBANN, we analyze the results in different distributions by means of a statistical study. In the first place, by observing the overlapping issue (high overlapping degree: F1 < 1.6, low overlapping degree: F1 ≥ 1.6), we note that the average ranks of DBANN are 2.15 and 2.2 with respect to F1 and GM in the high overlapping datasets while the average ranks in the low overlapping datasets are 2.61 and 2.68, respectively. These results support the ability of DBANN in face of high overlapping distribution. Moreover, it is also witnessed that when high overlap and high imbalance co-occur, i.e., IR > 20 and F1 < 1.6, DBANN still outperforms most of other methods. Especially, in datasets yeast1458vs7, yeast1289vs7, winequalityred3vs5, yeast2vs8 and yeast4, DBANN obtain the optimal results. This good behavior is due to the query neighbors selection mechanism of DBANN which helps to provide query neighbors with more reliable information when minority class is scarce and distribution is overlapping. Again, we implement Friedmen test (FR), Bonferroni-Dunntest (BD) (\(q_{\beta } = F\left( {0.05,5 \times 40} \right) = 2.45,\)\(q_{\gamma } = 2.576\)) on real-world datasets and find that DBANN presents significant difference from W-kNN, F-kNN in F1 and W-kNN, kNN, F-kNN and H-kNN in GM among kNN-based methods, as shown in Table 12. As for generality-oriented methods, DBANN also achieves superior performance which is listed in Tables 13 and 14. Especially, in terms of F1, DBANN gets the smallest average rank 2.2561 which is superior to the second rank 3.4146 by a large margin. Likewise, the significant test is listed in Table 15 which indicates that there are differences between DBANN and most of the algorithms except for HSB, SVM + SMOTE in GM, and KEC in F1.

Table 10 Comparative results between DBANN and kNN-based methods in F1 on real-world datasets
Table 11 Comparative results between DBANN and kNN based methods in GM on real-world datasets
Table 12 Results of the Friedmen test and the Bonferroni–Dunn test among kNN-based methods on real-world datasets (\({\text{CD}} = 1.0643{\kern 1pt} ,q_{\beta } = 2.45\))
Table 13 Comparative results between DBANN and generality-oriented methods in F1 on real-world datasets
Table 14 Comparisons of DBANN with generality-oriented methods in GM on real-world datasets
Table 15 Results of the Friedmen test and the Bonferroni–Dunn test among generality-oriented methods on real-world datasets (\({\text{CD}} = 1.4552{\kern 1pt} ,q_{\beta } = 2.25\))

6 Conclusions

In this study, we propose a novel method DBANN to deal with both imbalanced and overlapping problems. The main idea of DBANN is to find the most reliable query neighbors by using density-based methods. We first divide the training data into six parts by DBSCAN, and then in each part we assign reliable degree to instances based on density, class imbalance as well as overlapping situation. Afterward, we adjust the distance metric according to reliable degree to make reliable instances more likely to be selected as query neighbors. Finally, output is made by reliable query neighbors.

Different from existing kNN-based methods, DBANN takes advantage of both local and global information in query neighbors selection. Additionally, noise factor is also considered in DBANN to boost the classification performance. It is worth noting that the query neighbors in our method change adapt according to data distribution. To validate the effectiveness of DBANN, we implement experiments on both synthetic datasets and real-world datasets. The results show that our method outperforms kNN-based methods as well as generality-oriented methods in terms of F1 and GM.

Further research is required to extend DBANN to multi-class classification problems in the future. Moreover, we also plan to implement other density-based clustering methods in the framework of DBANN. Besides, it is interesting to set up a specific public datasets for algorithms comparisons on overlapping problems.