A novel density-based adaptive k nearest neighbor method for dealing with overlapping problem in imbalanced datasets

Yuan, Bo-Wen; Luo, Xing-Gang; Zhang, Zhong-Liang; Yu, Yang; Huo, Hong-Wei; Johannes, Tretter; Zou, Xiao-Dong

doi:10.1007/s00521-020-05256-0

A novel density-based adaptive k nearest neighbor method for dealing with overlapping problem in imbalanced datasets

Original Article
Published: 09 August 2020

Volume 33, pages 4457–4481, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

A novel density-based adaptive k nearest neighbor method for dealing with overlapping problem in imbalanced datasets

Download PDF

Bo-Wen Yuan^1,3,
Xing-Gang Luo²,
Zhong-Liang Zhang²,
Yang Yu¹,
Hong-Wei Huo³,
Tretter Johannes³ &
…
Xiao-Dong Zou³

806 Accesses
15 Citations
Explore all metrics

Abstract

Although a large number of solutions have been proposed to handle imbalanced classification problems over past decades, many researches pointed out that imbalanced problem does not degrade learning performance by its own but together with other factors. One of these factors is the overlapping problem which plays an even larger role in the classification performance deterioration but is always ignored in previous study. In this paper, we propose a density-based adaptive k nearest neighbor method, namely DBANN, which can handle imbalanced and overlapping problems simultaneously. To do so, a simple but effective distance adjustment strategy is developed to adaptively find the most reliable query neighbors. Concretely, we first partition training data into six parts by density-based method. Next, for each part, we modify distance metric by considering both local and global distribution. Finally, output is made by the query neighbors selected in the new distance metric. Noticeably, the query neighbors of DBANN are adaptively changed according to the degree of imbalance and overlap. To show the validity of our proposed method, experiments are carried out on 16 synthetic datasets and 41 real-world datasets. The results supported by the proper statistical tests show that our proposed method significantly outperforms the state-of-the-art methods.

Mass-Based Similarity Weighted k-Neighbor for Class Imbalance

A Proximity Weighted Evidential k Nearest Neighbor Classifier for Imbalanced Data

SNN-PDM: An Improved Probability Density Machine Algorithm Based on Shared Nearest Neighbors Clustering Technique

Article 17 May 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Imbalance classification, which has been widely applied in different scenarios including industrial manufacturing [1, 2], financial management [3, 4], biomedical engineering [5], information technology [6] and etc., is one of the critical issues in machine learning and data mining. Imbalanced datasets indicate a skewed distribution, namely the instances of one class outnumber the instances of other classes. Most of the standard classifiers tend to bias toward the majority class thus leading to high misclassification rate of minority instances. Imbalanced problem is commonly viewed as a main challenge to classification and has attracted great attention [7].

Existing solutions for imbalanced problem can be roughly grouped into four categories: resampling techniques, algorithm modification methods, cost-sensitive learning approaches and ensemble learning methods.

Resampling techniques aim to rebalance the training dataset by means of some mechanisms to generate a more or less balanced class distribution which is suitable for standard classifiers [8, 9].
Algorithm modification methods try to adjust the structure of standard classifiers to diminish the effect caused by class imbalance [10].
Cost-sensitive learning approaches generally consider higher cost for minority class to compensate for the scarcity of minority data [11].
Ensemble learning methods are originally developed to enhance the classification ability by combining different single classifiers. Moreover, researchers modify ensemble algorithms to adapt to imbalanced problem and show promising results [12,13,14].

In addition to imbalanced problem, overlapping between classes is convinced as another factor which degrades the learning performance [15, 16]. Overlap appears when a region contains almost equal numbers of instances from different classes. This situation results in a roughly same prior probability for each class and thus brings a strong handicap for classification. Overlapping problem is pervasive in many real-world applications such as fault diagnosis [17], character recognition [18], speech classification [19] and drug design [20]. In these scenarios, instances from different classes usually have similar characteristics in the feature space. For example, in character recognition, letters ‘O’, ‘o’ and numeral ‘0’ have almost identical shape which results in an overlapping region in the feature space therefore hard to separate. Previous investigations have shown that overlap degrades classification performance even more severely than imbalance [21]. To clearly present the relationship between two factors, a series of experiments are conducted by varying the degree of imbalance and overlap in the dataset. The conclusions state that learning algorithms can yield competitive performance when dataset has low overlapping degree combined with high imbalance ratio, but they are hard to achieve desirable results in high overlapping degree even imbalance ratio is low. This demonstrates that overlap is the main factor of classification degeneration [21,22,23]. Furthermore, Denil and Trappenberg [15] took size of dataset into consideration, and the study revealed that in small datasets the learning process is hindered by imbalance and overlap, respectively. However, when training data are sufficient, two factors jeopardize the learning performance interact.

Currently, most of the researches deal with overlapping and imbalanced problems separately. However, in practical application level, overlapping problem frequently occurs in imbalanced data which poses a greater challenge to classification. Although a few papers attempt to consider both factors as a whole [24, 25], the structures of related algorithms are too complex to implement. To fill this gap, we propose a density-based adaptive nearest neighbor method (DBANN) which can deal with these two problems simultaneously with a simple structure. The main idea of DBANN is to develop an adaptive distance adjustment strategy which devotes to defining and making use of reliable query neighbors. To the best of our knowledge, our approach is the first kNN-based method which aims to combat both imbalanced and overlapping problems. The main contributions of our study can be summarized as follows:

We propose a density-based kNN method named DBANN which can handle imbalanced and overlapping problems simultaneously.
To enhance classification ability, we develop a distance adjustment strategy using density-based methods to adaptively find out the most reliable query neighbors.
To validate the effectiveness of DBANN, we compare with other state-of-the-art methods on 16 synthetic datasets and 41 real-world datasets, respectively.

The outline of this paper is organized as follows: The related works are described in Sect. 2. In Sect. 3, we introduce our proposed method DBANN. Section 4 presents extensive experiments on both synthetic and real-world datasets. Results and discussion are shown in Sect. 5. Finally, Conclusions are presented in Sect. 6.

2 Related works

2.1 Overlapping problem in imbalanced datasets

In binary classification, imbalanced data refer to the distribution when instances of one class outnumber the other one, as shown in Fig. 1a. In this situation, minority class is hard to be recognized by standard classifiers which prefer to take a good coverage of majority class for achieving desirable global performance. However, in real applications, minority class always contains critical information we need, such as scrap part data within all products, patient information among all people. Therefore, it is imperative to take a deep insight into the data intrinsic characteristic in imbalanced data. With this in mind, we realize that imbalanced data dose not hinder learning ability by its own but together with some other factors, such as size of a dataset [21, 26], noise [27, 28], small disjunct [29, 30] and data shift [31, 32].

Overlap occurs in the region where both classes co-exist and is viewed as one of the main obstacles to classification [15], as depicted in Fig. 1b. In such region, the probability of each class is approximately equal which gives rise to high misclassification rate. In order to quantify overlapping degree for individual feature dimension, Ho and Basu [33] proposed a metric called maximum Fisher’s discriminant ratio (F1), as shown in Eq. (1) where $\mu_{1} ,\mu_{2,} \sigma_{1} ,\sigma_{2}$ indicate the means and variances of the two classes, respectively. For a multidimensional dataset, the maximal $f$ among all the features is defined as F1. Datasets with a low value of F1 will have a high degree of overlap and vice versa. To overcome overlapping problem, some researchers tended to change data distribution before modeling. Tang [16] firstly transformed original data into a more separated data distribution throughout overlapping pattern extraction and rough set theory, and then a proposed DR-SVM is implemented on the transformed data. Batista et al. [34] used data cleaning techniques to cope with highly overlapping data and achieved desirable results. Similarly, in order to better prepare the data for classification, other pre-processing methods such as data selection and feature selection are proposed in [35, 36]. However, the pre-processing methods may involve the risk of noise introduction or information loss. Xiong et al. [37] found that modeling the overlapping and non-overlapping regions, respectively, is a promising scheme for solving class overlapping problem. Follow this line of thinking, Vorraboot et al. [38] first partitioned training data into non-overlapping region, borderline region and overlapping region. Afterward, different techniques were employed for different regions. Finally, the outputs of all techniques were combined. Nevertheless, the study is only suitable for two Gaussian classes with independent and identical distributions.

$$f = \frac{{\left( {\mu_{1} - \mu_{2} } \right)^{2} }}{{\sigma_{1}^{2} + \sigma_{2}^{2} }}$$

(1)

In the real-world application cases, overlapping problem and imbalanced problem co-exist frequently in the dataset but are always ignored in previous studies. Figure 1c marks such two overlapping regions with circles. Comparing with Fig. 1b, we can find that overlapping regions here exhibit comparative high density in the perspective of global distribution. That is partly attributed to the sparsity of minority class which inversely accentuates the compactness of two overlapping regions. Additionally, the imbalance ratio is different in overlapping region and other regions. It is worth noting that for some learning algorithms which are based on a divide and conquer strategy [39], the variation of class distribution in different regions may pose a threat to the classification performance. Besides, overlapping and imbalanced problems also influence those algorithms which are sensitive to data density such as k nearest neighbors.

To address overlapping and imbalanced problems, Alejo et al. [25] developed a hybrid method which combines a modified back propagation (MBP) with a gabriel graph editing technique (GGE). MBP copes with imbalanced issue and GGE is responsible for overlapping problem. Vuttipittayamongkol et al. [40] proposed an overlap-based under-sampling method (OBU). Based on elimination of majority instances from overlapping region, OBU improves the visibility of minority instances. So far, the related studies are far from enough.

2.2 kNN-based methods for dealing with overlapping or imbalanced datasets

kNN is one of the typical non-parametric approaches which is widely applied in diverse domains due to its simple but powerful decision rule [41, 42]. However, when encountering imbalanced class distribution, kNN tends to lose power on yielding competitive results [43, 44]. To cope with it, kNN is modified to incorporate immunity against the influence of imbalance. Concretely, Kriminger et al. [43] proposed a class conditional nearest neighbor distribution algorithm (CCNND). To mitigate the effect of imbalanced distribution, for each class, CCNND calculates the number of training instances which satisfies a specified distance condition within the k nearest neighbors of a query instance. Afterward, an empirical cumulative distribution function (CDF) is built and the probability for each class is computed. Dubey and Pudi [45] provided a weighting scheme (here after called W-kNN) to address imbalanced issue. W-kNN assigns weight to each class based on the misclassification rate obtained by traditional kNN. Patel [46] developed a hybrid weighted strategy (here after called H-kNN). The main advantage is the use of dynamic k value, i.e., small k for minority class and large k for majority class. Therefore, H-kNN improves the ability to fully mine the information from imbalanced distribution. On this basis, the same author took fuzzy rule-based classification into consideration and proposed an improved fuzzy k-nearest neighbor (here after called F-kNN) [47]. Based on fuzzy membership, the query instance is allowed to know prior that how much its neighbors belong to a class. Zhang and Li [48] presented a minority-biased nearest neighbor algorithm called PNN. In order to handle the inappropriate probability estimation for minority class, PNN fixes the number of minority query neighbors. For example, m-PNN means that there must be m minority instances in query neighbors. Therefore, the number of query neighbors changes dynamically to ensure enough instances for probability estimation for both classes. k rare-class nearest neighbor classification (kRNN) [49] boosts PNN by updating the dynamic query neighbors strategy. The new strategy reinforces the analysis of distributions around query instances. kRNN can handle not only inter-class imbalance but also within class imbalance. Mullick et al. [50] proposed an adaptive learning kNN method called Ada-kNN. It uses a class-based global imbalance handling scheme (GIHS) to compensate for the disadvantage of minority data scarcity. To assign global weight for each class, GIHS considers both ideal class probability (balanced distribution) and reality class probability (imbalanced data distribution).

Overlapping problem is another obstacle for learning algorithms with no exception to kNN, as stated in Sect. 2.1. Garcia et al. [44] investigated the behavior of kNN when overlap exists in imbalanced data. The results reveal that when imbalance ratio in overlapping region equals to global imbalance ratio, i.e., majority class dominates the overlapping region, true positive rate (TPR) drops with the increase in overlapping degree. Conversely, when minority class turns to be the most represented class in overlapping region, the TPR increases on the opposite. Additionally, they pointed out that imbalance ratio in overlapping region accounts more than the size of overlapping region and global imbalance ratio. Wang et al. [51] proposed an extremely simple but well performed algorithm called A-kNN. It aims to form reliable query neighbors for final decision. To do so, A-kNN modifies the distance metric to move the reliable instances closer to the query instance. Hence, even the query instance locates in the overlapping region which is viewed as ambiguous and untrusted area, the query neighbors selected after distance adjustment are reliable. Although A-kNN is an effective solution for overlapping problem, it cannot handle imbalanced issue.

Even though some efforts are made for kNN to enhance the classification performance, there are still some drawbacks remaining to be improved in further research. Firstly, previous kNN-based methods treat imbalanced and overlapping problems separately though the two factors always co-exist in the real-world applications. Secondly, some modified kNN methods choose pivot instances as query neighbors for decision making, nevertheless, the choice criterion they use considers either global or local information. When data density varies a lot in different regions, especially when imbalance ratio is significantly different among these regions, the choice criterion cannot work effectively anymore. Finally, previous works ignore the influence of noisy instances which can jeopardize the classification performance significantly. In the following sections, we will solve above concerns with a novel method.

3 Combating overlapping and imbalanced problems using density-based adaptive k nearest neighbor method (DBANN)

In this section, we expect to conquer overlapping and imbalanced problems by a density-based strategy. For ease of discussion, in this paper, we focus on binary class problem even though it can be generalized to multi-class problem. Considering a given training dataset $D$ with $N$ instances, $D = \left\{ {\left( {x_{1} ,y_{1} } \right),\left( {x_{2} ,y_{2} } \right) \ldots \left( {x_{{n}} ,y_{{n}} } \right)} \right\}$, where $x_{i} \in X$ is a training instance in $m$ dimensions, $x$ is a query instance. Traditional kNN first determines the $k$ nearest neighbors of query instance based on euclidean distance in Eq. (2). Then, majority vote in Eq. (3) is used to classify $x$. Besides, $IR$ in Eq. (4) indicates the imbalance ratio of a dataset, where $N_{{\rm minority}}$ refers to the number of minority instances and $N_{{\rm majority}}$ refers to the number of majority instances. It is worth noting that we use ${\text{IR}}_{{\rm global}}$ as the imbalance ratio of the whole dataset and ${\text{IR}}_{{\rm local}}$ as the imbalance ratio in a specified region in the sequent sections.

$$d\left( {x,x_{i} } \right) = \left( {\sum\limits_{j = 1}^{m} {\left| {x^{j} - x_{i}^{j} } \right|^{2} } } \right)^{1/2}$$

(2)

$$f\left( x \right) = \text{sgn} \left( {\sum\limits_{i}^{k} {\left( {y_{i} } \right)} } \right)$$

(3)

$${\text{IR}} = \frac{{N_{{\rm majority}} }}{{N_{{\rm minority}} }}$$

(4)

3.1 Description of reliable query neighbors

Previous studies have shown the significance of query neighbors in classification for kNN [48, 52]. Inspired by this, the main idea of DBANN is to seek for reliable query neighbors which are used for majority vote. Different from traditional kNN, DBANN selects query neighbors depending not only on distance but also data characteristic, i.e., imbalanced and overlapping distribution in local and global region. Besides, we expect the query neighbors can adaptively change to adapt different imbalanced and overlapping degrees. Therefore, first of all, we describe the characteristics of reliable query neighbors as follows:

There is no doubt that the query neighbors should be the instances locating near around the query instance so as to include more representative information.
Due to the scarcity of minority data, the query neighbors should be biased toward the minority class.
Since instances in overlapping region are hard to separate, it is suggested to view these instances unreliable and lower the probability of the selection for query neighbors.
In most cases, noisy instance is an obstacle for classification [28, 53, 54]. Hence, it should be avoided to be selected as query neighbors.
Since kNN is proved to be sensitive to data complexity and class density [44], it is desirable to involve density and class distribution factors into consideration.

3.2 A density-based adaptive k nearest neighbor method (DBANN)

Most of previous researches of kNN method concentrate on addressing imbalanced problem but overlook the influence of overlap. Literature [51] proposes an adaptive kNN method (A-kNN) to deal with overlapping problem. Different from traditional kNN, A-kNN modifies distance metric according to overlapping degree. First of all, for each training instance $x_{i}$, A-kNN creates a reliable coefficient $r_{i}$. It is the distance from a training instance $x_{i}$ to another training instance x_j which is the nearest neighbor belonging to different class from $x_{i}$, as listed in Eq. (5). Based on observation, we can easily find that a lower $r_{i}$ value means $\left( {x_{i} ,x_{j} } \right)$ locate closer to each other which implies that high overlapping degree exists in this region; therefore, $x_{i}$ is viewed as an unreliable instance and vice versa. With this in mind, we know that $r_{i}$ value can measure reliable degree of $x_{i}$. A high $r_{i}$ value implies that $x_{i}$ has high reliable degree and helpful in classification, on the contrary, a low $r_{i}$ value indicates that $x_{i}$ is useless and unreliable. After obtaining $r_{i}$ for each training instance, A-kNN adjusts and forms a new distance metric by Eq. (6). Finally, the output is obtained by Eq. (3). Technically speaking, A-kNN is a local method to handle overlapping problem. However, the imbalance issue is not considered.

$$r_{i} = \mathop {\hbox{min} }\limits_{{l:y_{i} \ne y_{j} }} d\left( {x_{i} ,x_{j} } \right)$$

(5)

$$d_{{\rm new}} \left( {x,x_{i} } \right) = \frac{{d\left( {x,x_{i} } \right)}}{{r_{i} }}$$

(6)

In this study, we extend the concept of $r_{i}$ to handle both imbalanced and overlapping problems.

Concretely, in the first step, we cluster training data into several clusters and noisy instances by a density-based method (we will introduce it in Sect. 3.3). To further characterize the clusters, we consider overlapping issue and divide clusters into two types with definitions listed as follows.

Definition 1

Overlapping cluster indicates that the cluster contains both majority and minority instances.

Definition 2

Non-overlapping cluster indicates that the cluster contains only majority or minority instances.

Therefore, after clustering, training data can be divided into six parts: (a) minority noisy instances, (b) majority noisy instances, (c) majority instances in overlapping cluster, (d) minority instances in overlapping cluster, (e) majority instances in non-overlapping cluster and (f) minority instances in non-overlapping cluster. Figure 2 depicts the six parts in detail.

Afterward, in the second step, we assign reliable coefficient $r_{i}$ to each training instance $x_{i}$. Different from A-kNN, we capture the distribution variation and take noise factor into consideration. Specially, we assign $r_{i}$ for training instances in each part as follows:

(a)
For minority noisy instances, $r_{i}$ is assigned as the distance from $x_{i}$ to the nearest majority neighbor.
(b)
For majority noisy instances, $r_{i}$ is assigned as a randomly little positive value.
(c)
For minority instances in overlapping cluster, $r_{i}$ is assigned as the distance from $x_{i}$ to the ${\text{IR}}_{{\rm local}}$th nearest majority neighbor. ${\text{IR}}_{{\rm local}}$ equals to the imbalance ratio in the corresponding cluster which represents the local distribution. Obviously, high imbalance ratio expands the detection radius of $x_{i}$ and obtains larger $r_{i}$ accordingly. Noticeably, $r_{i}$ also relates to the density of a cluster. A cluster with a high density means a large amount of instances locating in a small region together, and even detection radius of $r_{i}$ expands to the ${\text{IR}}_{{\rm local}}$th nearest majority neighbor it may probably increase by only a small value compared with low-density clusters.
(d)
For majority instances in overlapping cluster, $r_{i}$ is assigned as the distance from $x_{i}$ to the nearest minority neighbor.
(e)
For minority instances in non-overlapping cluster, $r_{i}$ is assigned as the distance from $x_{i}$ to the ${\text{IR}}_{{\rm global}}$ th nearest majority neighbor. Different from ${\text{IR}}_{{\rm local}}$ in overlapping situation, ${\text{IR}}_{{\rm global}}$ here is calculated by global imbalance ratio.
(f)
For majority instances in non-overlapping cluster, $r_{i}$ is also assigned as the distance from $x_{i}$ to the nearest minority neighbor.

In the next step, we adjust the distance metric by Eq. (6). We can find that the new distance metric depends on two conditions, $r_{i}$ and the euclidean distance $d\left( {x,x_{i} } \right)$. A training instance $x_{i}$ is reliable when it locates near to the query instance $x$ as well as possessing higher $r_{i}$ value. It is obvious that, after distance adjustment the reliable instances are pulled closer to query instance and the unreliable ones are pushed away.

Finally, majority vote is implemented in new distance metric by Eq. (3). From the procedure introduced above, we can find that in $r_{i}$ assigning procedure, our method considers not only local but also global distribution of the dataset.

Figure 2 illustrates the process of DBANN. The star and circle represent the majority and minority class, respectively, and the shape of hollow and solid represent the original and current distribution (after distance adjustment). Here, we only focus on the change of the instances marked with letters. Firstly, we concentrate on points a and b which distribute in the overlapping cluster. From the graph, we can see two instances distributing closer to each other which indicates that high overlapping degree exists. Therefore, they are viewed as unreliable points and ought to be pushed far away from the query instance by Eqs. (5) and (6). Nevertheless, except for overlapping degree, imbalanced issue is also considered. As a result, only majority point b is pushed to B whereas point a is pulled to A which is closer to query instance due to local imbalanced ratio (${\text{IR}}_{{{\text{local}}2}} = 2$). Similar distance adjustment method is also suitable for points c and d, and the only difference is that point c is put closer to query instance due to higher local imbalance ratio (${\text{IR}}_{{\rm local1}} = 7$). Additionally, majority noisy point e is pushed far away due to the little number setting and minority noisy point f is pushed to F. As for minority point g in non-overlapping cluster, global imbalance ratio (${\text{IR}}_{{\rm global}} = 3$) is used to pull it to G, and majority point h is moved to H. Consequently, after distance adjustment, the query neighbors are point A, C, G and query instance is predicted as minority. Details of DBANN are listed in Algorithm 1.

3.3 Density-based clustering algorithm

In Sect. 3.2, we take advantage of density-based clustering method to divide training data into different parts. Technically speaking, it is a framework of our method which implies that many existing density-based clustering algorithms are optional in DBANN. In this paper, we choose DBSCAN as our method.

DBSCAN is a typical density-based clustering algorithm which defines cluster as a region of high dense points separated by regions of lower dense points. It has attracted much attention by its desirable properties including arbitrary shaped clusters, automatic cluster number identification and noise detection [55, 56]. Additionally, as a useful method to capture data distribution, DBSCAN is always implemented by combing with other algorithms to enhance classification ability, especially when facing severely complex data distribution [38, 57]. Generally, DBSCAN characterizes density variation by resorting to two input parameters, a positive value $eps$ and a positive constant integer $Minpts$. On this basis, some definitions of DBSCAN are listed as follows:

Definition 1

$eps{\text{-}}neighborhood$ of a point $p$ indicates the points within the radius $eps$ around $p$.

Definition 2

A point $p$ is a core point if the number of its $eps{\text{-}}neighborhood$ is more than $Minpts$.

Definition 3

A point $p$ is directly density-reachable from a point $q$ if $q$ is a core point and $p$ is its $eps{\text{-}}neighborhood$.

Definition 4

A point $p$ is density-reachable from a point $q$ if there is a chain of points $p_{1} ,{\kern 1pt} {\kern 1pt} \ldots {\kern 1pt} p_{n}$, $p_{1} = q$, $p_{n} = p$ which satisfies that $p_{i + 1}$ is directly density-reachable from $p_{i}$.

Definition 5

A border point $p$ is the $eps{\text{-}}neighborhood$ of a core point $q$ which has fewer neighbors than $Minpts$ within the same $eps$ radius.

Definition 6

A noisy point $p$ is the point neither a core point nor a border point.

Initially, DBSCAN arbitrarily selects a point $p$ and retrieves all $eps{\text{-}}neighborhood$, this process is defined as QueryNeighbour. If the number of $eps{\text{-}}neighborhood$ is larger than $Minpts$, point $p$ is assigned as core point and thus forming a new cluster, otherwise, $p$ is assigned as a noisy point. Subsequently, the cluster expands by adding unvisited density-reachable points iteratively. The process is repeated until every unvisited point is marked either in a cluster or a noisy point. Noticeably, even a point is marked as a noisy point initially, it may be transformed to a border point of other cluster during cluster expanding process. Finally, DBSCAN forms several clusters and noisy points. Figure 3 demonstrates the process of DBSCAN ($Minpts$ = 4, $eps$ is indicated by the circles). As can be seen from the graph, point A is marked as core point at the beginning and thus creates a new cluster. Afterward, the cluster expands based on density measurement. It involves all blue points in the cluster until it reaches the yellow border points (F, G) which are the edge of the cluster. Due to low density, points (H, I) are marked as noisy points. The detail of DBSCAN is shown in Algorithm 2.

4 Experiments

In this section, experiments are carried out on 16 synthetic datasets and 41 real-world datasets to validate the effectiveness of DBANN method. The details of datasets are described in Sect. 4.1. In Sect. 4.2, we list the comparative algorithms and the corresponding parameters setting. Besides, the evaluation metrics and statistical tests are introduced in Sect. 4.3. All experiments are carried out by using python (Version 2.7.14).

4.1 Datasets used in the experiments

In this section, experiments are conducted on both synthetic and real-world datasets. Specially, synthetic data are generated for both classes from bivariate normal distributions. On the one hand, to explore the classification performance in different overlapping degrees, the mean vector of minority class is fixed at [0.00, 0.00], and the mean vector of majority class is set at [0.05, 0.05], [0.50, 0.50], [1.30, 1.30], [2.70, 2.70], respectively which represents four overlapping degrees. In detail, mean vector of [0.05, 0.05] means the data centers of two classes are closed; thus, severe overlapping region exists. Mean vector of [2.7, 2.7] indicates that the two centers locate far away so that the overlapping degree is rare. On the other hand, synthetic data are also generated into four different imbalance ratios by changing numbers of both classes. The description of synthetic datasets is shown in Table 1. As for real-world applications, we select 41 datasets from KEEL repository [58] refers to previous research [35], as shown in Table 2. KEEL is an open source which provides benchmark datasets for assessing the behaviors of the algorithms in different scenarios [58]. Datasets are ordered by overlapping degree according to Fisher’s discriminant ratio (F1), in which we can divide them into two parts, low overlapping datasets with F1 > 1.6 and high overlapping datasets with F1 < 1.6. Imbalance ratio (IR) of datasets ranges from 1.8 to 68.1, as shown in Table 1.

Table 1 Introduction of 16 synthetic datasets

A novel density-based adaptive k nearest neighbor method for dealing with overlapping problem in imbalanced datasets

Abstract

Similar content being viewed by others

Mass-Based Similarity Weighted k-Neighbor for Class Imbalance

A Proximity Weighted Evidential k Nearest Neighbor Classifier for Imbalanced Data

SNN-PDM: An Improved Probability Density Machine Algorithm Based on Shared Nearest Neighbors Clustering Technique

Explore related subjects

1 Introduction

2 Related works

2.1 Overlapping problem in imbalanced datasets

2.2 kNN-based methods for dealing with overlapping or imbalanced datasets

3 Combating overlapping and imbalanced problems using density-based adaptive k nearest neighbor method (DBANN)

3.1 Description of reliable query neighbors

3.2 A density-based adaptive k nearest neighbor method (DBANN)

Definition 1

Definition 2

3.3 Density-based clustering algorithm

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Definition 6

4 Experiments

4.1 Datasets used in the experiments

4.2 Algorithms and parameter settings

4.3 Performance measures and significance statistical test

5 Results and discussion

5.1 Analyzing the critical parameters and property of DBANN

5.1.1 \(eps\) value

5.1.2 \(k\) value

5.1.3 Distribution of reliable query neighbors

5.1.4 Advantage of DBANN over other kNN-based methods

5.2 Overall performance of DBANN

5.2.1 Performance on synthetic datasets

5.2.2 Performance on real-world datasets

6 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation