1 Introduction

Artificial Neural-Network (ANN) is one of the widely used machine learning techniques to address challenging real life situations. The architecture of ANN is based on the connection of nodes/ artificial neurons in layer-wise structure. The neurons in each layer receive signal from the previous layer that gets processed using some non-linear function and transferred to the subsequent layer neurons through the edges. One common approach of ANN is to learn the weights of the network iteratively by minimizing the sum of squared error (SSE) between the output of the network and the actual target. In this way, ANN approximates the function that fits well on the data under consideration. Though the computation of the parameter weights works well for balanced data, the process becomes biased when imbalance is present in the data set. A data set is called imbalanced if the number of observations of one class (majority class) exceeds the number of observations in the other class (minority class). Imbalanced data has wide real world applications including customer churn prediction (Burez and Van den Poel 2009), financial distress prediction (Cleofas-Sánchez et al. 2016), gene regulatory network reconstruction (Ceci et al. 2015) and information retrieval and filtering (Piras and Giacinto 2012). In most of the applications, the minority class is usually of prime interest. Standard classifiers generally result in higher misclassifications of the minority class due to its bias towards the majority class resulting in sub-optimal solution (López et al. 2013; Thanathamathee and Lursinsap 2013).

In literature, many solutions have been proposed for improving the classification accuracy of the minority class without severely jeopardizing the classification accuracy of the majority class. These solutions have been categorized as sampling methods (Barua et al. 2012; Chawla et al. 2002; Han et al. 2005; Shahee and Ananthakumar 2018a, b), cost-sensitive learning methods (Sun et al. 2007), one class learning (Chawla et al. 2004), and feature selection (Alibeigi et al. 2012; Yin et al. 2013; Shahee and Ananthakumar 2019).

Sampling method is a preprocessing technique that provides a balanced class distribution so that the classifiers behave in a similar manner as traditional classifiers (Batista et al. 2004; Estabrooks et al. 2004). Cost sensitive learning considers different costs of misclassification of each example using cost matrix for handling class imbalance (Elkan 2001; Ting 2002). It assigns higher cost to misclassification of minority class observations compared to majority class observations. In one-class learning, one-class SVM is trained with only the target class (Tax and Duin 2004). Feature selection in case of imbalanced domain is to select the appropriate features for better classification of minority class. However, none of the above mentioned approaches provides uniformly superior performance when the classes are imbalanced. Sampling based method is considered to be the simplest technique for handling class imbalance because it provides balanced class distributions, without modifying the standard classification algorithm (Barua et al. 2012; Chawla et al. 2002; Han et al. 2005; Shahee and Ananthakumar 2018a, b).

Apart from class imbalance, certain data intrinsic characteristics like overlapping between the classes, lack of density and information in the training data, impact of noisy observations/outliers, presence of small disjuncts also worsen the performance of the classifiers (Alshomrani et al. 2015; Japkowicz and Stephen 2002; Jo and Japkowicz 2004; Prati et al. 2004; López et al. 2013). In many real-world applications, data exhibits class imbalance problem along with some data intrinsic characteristics (Tang et al. 2010).

In this paper, we propose a new method for binary class data that can handle class imbalance and class overlapping simultaneously along with noisy / outlier observations. Introduction of class overlap along with class imbalance adds additional challenges to the classification task. Batista et al. (2005) conducted an experiment on synthetic data and found that performance degradation of the classifier is not solely due to class imbalance but is also related to the degree of overlap between the classes. In literature, various methods have been proposed but most of the methods deal with class imbalance and class overlapping sequentially. The proposed method addresses the problem of class overlap in class imbalance using a very different approach of assigning different weights to the observations before training the ANN.

Some of the major contributions of this paper are summarized as follows:

  • We propose an overlap sensitive neural network where the loss function of the network varies with respect to the weights of the observations.

  • The weights of the observations are computed by taking into account the location of observations in the feature space.

  • The presence of noisy/outlier observations and the imbalance in class distributions are also given due consideration while computing the weights.

  • To demonstrate the effectiveness of the proposed method, we evaluate its performance on 12 simulated data with different scenarios and 23 real-world data sets and compare with the other methods. The results clearly show that the proposed method outperforms the other methods.

The organisation of the paper is as follows. Related work is presented in Sect. 2 followed by the details of the proposed method in Sect. 3. Evaluation details and parameter settings details are given in Sect. 4. Analysis on simulated data sets and real-life data sets are presented in Sects. 5 and 6 respectively. Section 7 presents discussion on the proposed method and finally conclusion is presented in Sect. 8.

2 Related work

Since our focus is to pursue the challenges in handling data with class overlap in the presence of class imbalance for neural network classifier, we shall review the relevant literature. As shown in Fig. 1, overlapping data enclose an ambiguous region in the feature space where the prior probability of the classes are roughly equal (Das et al. 2013). Generally, overlapping is caused by lack of features to differentiate the classes. When the data set has overlapping regions, traditional classifiers are not able to find a feasible solution for classification (Xiong et al. 2010). In other words, overlapping nature of data sets makes it difficult to identify a class boundary that can perfectly separate the classes (Das et al. 2013). In comparison to solving the problems of imbalanced and overlapping classes independently, finding a solution for classes with both overlap and imbalance is more difficult.

In general, a kernel function is used to solve the class overlap problems (Das et al. 2013; Qu et al. 2011). A kernel function transforms the data from lower dimensional space to higher dimensional space, maximising the chance of identifying a linear separator in higher dimension feature space. However, class overlap can still exist in the higher dimensional feature space (Lee and Kim 2018).

Fig. 1
figure 1

Left data without class-overlap, Right data with class-overlap

Xiong et al. (2010) introduced an approach in which data located in the overlapping region is treated differently from data in the non overlapping region. Support Vector Data Description (SVDD) is used to find overlapping regions, followed by three overlapping-class modeling schemes, namely discarding, merging and separating. In discarding scheme, models are trained based on data that lie in the non-overlapping region, while data in the overlapping is discarded. In case of merging scheme, data in the overlapping region are given a new class label “overlapping” and two models are trained. The first model is trained by considering overlapping region as a new class and the second model is trained only on data in the overlapping region. On test data sets, if the first model classifies as overlapping class, then the second model is used to determine the original class. In the case of separating scheme, though two models are trained, the first model is trained only on the overlapping region and the second model is trained only on the non-overlapping region. Findings of the paper suggest that the separating scheme is the best among the three schemes. However, in all the three cases, models are trained multiple times leading to decrease in computational efficiency.

Tang et al. (2010) used a probabilistic neural network (PNN) to divide the feature space into overlapping and non-overlapping regions. For deciding the overlapping region, two considerations are taken into account. First, the overlapping region should be large enough to accommodate most of the potentially misclassified observations to ensure that the classification of observations lying in the non-overlapping region is highly accurate. Second, the overlapping region should not be too wide to include too many patterns. For classification of the test set, if an observation falls in the non-overlapping region, then it is classified based on the highest posterior probability. If the test observation falls in the overlapping region, a new method is suggested for classification.

Das et al. (2013) proposed ClusBUS ( Clustering-based undersampling technique) for handling class imbalance along with class overlap. Their method identifies different clusters present in the data set without considering the class. After that, it computes the ratio (r) of the number of minority class observations and the size of the cluster. Based on the computed r value and empirically determined threshold \(\tau \), it removes all of the majority class observations and retains the minority class observations. In each cluster, vacuum gets created around the minority class observations thus helping classifiers in learning the decision boundary efficiently. However, removal of the majority class could result in loss of information.

Tang and Gao (2007) proposed a multi-model classifier (DR-SVM) which combines SVM and kNN under rough set technique. KNN is used to identify the boundary data and the remaining data in each class is called positive region pattern. Two SVM classifiers are trained in DR-SVM and a pair of separating hyperplanes are obtained. The test set falling in the overlapping region is classified by KNN algorithm.

Lee and Kim (2018) proposed an overlap-sensitive margin (OSM) classifier that separates the data space into soft and hard-overlap regions using the modified fuzzy support vectors machine algorithm. Test set observations falling in soft-overlap region are classified based on decision boundary of the OSM classifier and those observations falling in hard-overlap region are classified using KNN algorithm with \(k = 1\). The key point of this method is that each observation has different cost of misclassification. However, this method results in significant decrease in the classification of majority class observations when data is highly imbalanced. Furthermore, this method does not take into account the presence of outliers or noisy examples in the data set.

Lin et al. (2017) proposed a focal loss function for handling class imbalance. The loss function modulates the cross entropy loss function in such a way that it focuses the learning on hard to learn examples and down-weigh the contribution of numerous easy examples. However, the loss function is not balanced between the classes. Cui et al. (2019) modified different loss functions, especially the focal loss function to make it class-balanced loss function. The class-balanced loss is a re-weighting loss function where weight is inversely proportional to the effective number of samples. The data overlap is considered in quantifying the effective number of samples. The proposed class-balanced loss achieves significant performance gains on long-tailed data sets.

From the above literature, we observe that every method has its own limitation and further, most of the studies have considered the problems of class imbalance and overlapping sequentially. In this paper, we propose an overlap sensitive neural network that takes into account class overlap and class imbalance simultaneously. Our algorithm not only increases the performance of the classifiers on the minority class but it also ensures that the performance on the majority class is not compromised. In addition, the proposed method also considers the presence of noisy examples or outliers in the data set.

3 The proposed method

From the literature, it can be seen that one way of handling class imbalance is to assign higher weight to the minority class compared to the majority class. Motivated from this notion, rather than entire class being assigned a weight, we assign different weights to the observations in our proposed method to handle class imbalance and class overlap simultaneously. The assigned weight to each observation acts as cost of misclassification with respect to that observation. This section discusses our proposed method by describing each of the components in the subsequent subsections.

3.1 Neural network training

In ANN, input is fed via input layer followed by Sigmoid activation function applied at each neuron that finally gets transferred to the output layer. This is also called feedforward network as shown in Fig. 2.

Fig. 2
figure 2

Feed-forward-neural-network-architecture

In forward propagation, the output of the jth hidden unit in the first hidden layer is obtained as

$$\begin{aligned} a_j^{(2)} = \sum _{i=1}^n \theta _{ji}^{(1)} x_i + \theta _{j0}^{(1)} \end{aligned}$$
(1)

where the superscript indicates the layer it belongs to. Here, \(\theta _{ji}^{(1)}\) denotes the weight parameter connecting ith input and jth neuron and \(\theta _{j0}^{(1)}\) denotes the bias for hidden unit j. Now \(a_j\) is transformed using the sigmoid activation function h(.) (McClelland et al. 1988) resulting in

$$\begin{aligned} z_j = h(a_j) \end{aligned}$$
(2)

where

$$\begin{aligned} h(x) = 1/ (1 + exp(-x)) \end{aligned}$$
(3)

Similar process is followed in the next layer with inputs as \(z_j\) to finally obtain the output of the network.

For determining the weights \(\theta \), ANN in general considers the cross-entropy error function for a classification problem as it leads to faster training and improved generalization (Simard et al. 2003). In the current study, each observation is assigned different importance depending on its location in the feature space and accordingly, the error function considered is

$$\begin{aligned} E(\theta ) = -\sum _{n = 1}^N \{y_{n} \log z_{n} + (1-y_{n})\log (1-z_{n})\}*w_n \end{aligned}$$
(4)

where \(z_{n}\) denotes the output of the network for the input \(x_n\), \(y_{n}\) is the actual class label and \(w_n\) denotes the weight of the nth observation. Here, the gradient of \(E(\theta )\) w.r.t \(\theta \) is computed using backpropagation iteratively.

As \(E(\theta )\) depends on the weight \(\theta _{ji}\) through the summed input \(a_j\), we apply the chain rule for partial derivative given by

$$\begin{aligned} \frac{\partial E_n (\theta )}{\partial \theta _{ji}} = \frac{\partial E_n(\theta )}{\partial a_j} \frac{\partial a_j}{\partial \theta _{ji}}w_n \end{aligned}$$
(5)

Let us denote

$$\begin{aligned} \delta _j \equiv \frac{\partial E_n(\theta )}{\partial a_j} \end{aligned}$$
(6)

Hence, Eq. (5) becomes

$$\begin{aligned} \frac{\partial E_n (\theta )}{\partial \theta _{ji}} = \delta _j \frac{\partial a_j}{\partial \theta _{ji}}w_n \end{aligned}$$
(7)

where \(\frac{\partial a_j}{\partial \theta _{ji}} = x_j\) for layer 1 and \(z_j\) for other layers.

This implies that the required derivatives are computed by multiplying the weight \(w_n\) of each observation, by the product of the z value at the input end of the weight and the \(\delta \) value of the following layer. To obtain the derivatives in (7), \(\delta \)’s are computed next.

For the output neuron,

$$\begin{aligned} \delta _k = z_k - y_k \end{aligned}$$
(8)

where \(z_k\) is the predicted value of the kth observation and \(y_k\) is the true class label.

Now, the \(\delta \) values for hidden layer neurons are obtained by propagating the \(\delta \)’s recursively from the next higher layer using

$$\begin{aligned} \delta _j = h^{'}(a_j) \sum _k \theta _{kj} \delta _k \end{aligned}$$
(9)

Substituting these \(\delta \)’s in Eq. (7), the required derivatives are computed.

3.2 Computation of observation weights

The main motivation of assigning weights to the observations is to increase the performance of the classifier on the minority class without losing the performance on the majority class. The computation of weights of observations takes into account class overlapping, class imbalance and noisy/outlier observations. For this, data is divided into three parts, namely safe zone, overlapping and outlier or noisy observations. An observation is called a safe zone observation if it is surrounded by the same class observations. Overlapping observations are those that have few other class observations in its neighbourhood. When an observation is surrounded by all the observations of the other class, it is referred to as an outlier.

3.2.1 Class overlapping

For handling overlapping between the classes, we assign weights to the observations with respect to the level of overlap. Higher weights are assigned to the observations that are less overlapping and lower weights to the observations that are of high overlapping in nature. For incorporating overlapping between the classes, propensity score of each observation is defined by using \(K-NN\) algorithm. The value of K is set to 5 like other well known methods in class imbalance domain (Chawla et al. 2002; He et al. 2008).

$$\begin{aligned} P = NN /5 \end{aligned}$$
(10)

where P is the propensity score and NN is the number of examples from the same class. \(P = 0\) means the observation is located inside the other class, in other words, it is an outlier observation. \(P = 1\) refers to safe zone observation, surrounded by observations of the same class. An observation with \(0< P < 1\) is referred to as an overlapping observation. Hence, overlapping between classes is incorporated by computation of the propensity score.

To accommodate different levels of overlapping of outliers in our method, we consider a parameter C in the range [0.0, 0.20]. This range is chosen so that the weight assigned to an outlier cannot exceed that assigned to an overlapping observation. Thus, each observation is assigned a propensity score including outliers being assigned a score C.

3.2.2 Handling class imbalance

After dealing with class overlapping by using the propensity score, we now have to deal with class imbalance. One of the techniques for handling class imbalance is to preprocess the data that diminishes the effect of class imbalance by either increasing the minority class observations or by decreasing the majority class observations. In this study, we diminish the effect of class imbalance by making the total weight of the minority class equal to the total weight of the majority class. To do this, we define the Overlapping imbalance ratio (OIR), calculated as the number of overlapping observations in the minority class (nn1) divided by the number of overlapping observations in the majority class (nn0).

$$\begin{aligned} OIR = nn1 / nn0 \end{aligned}$$
(11)

Then we multiply the weights of overlapping majority class observations by OIR. Further, to make the total weights of majority class and minority class equal, the minority class examples are multiplied by a factor WGT where WGT is defined as

$$\begin{aligned} WGT = \text {sum of majority class weights} / \text {Sum of minority class weights} \end{aligned}$$
(12)

In this way, the sum of the weights of both the classes get balanced and this also ensures that the weights of observations in safe zone are higher than the weights of observations in the overlapping region which in turn are higher than the outliers.

Fig. 3
figure 3

Different types of outliers: A is least overlapping, followed by B & D and then by E

3.2.3 Outlier adjustment

While handling outliers, all such observations were assigned the same score C as indicated in Sect. 3.2.1. Further, weight adjustment while handling class imbalance changes the weights of the outliers uniformly. It is possible that these outliers are not of the same kind as some may be easy to classify than other outliers. This points out the need to weigh the outliers based on their level of overlapping. To illustrate this aspect, Fig. 3 shows four outliers A, B, D and E of the minority class. Here, A is close to the minority class and is relatively easily classifiable compared to B and D and hence the level of overlapping is the least for A, followed by B and D and then by E. Thus, the assigned weights should be in decreasing order of the extent of overlapping. To accommodate this, we compute the distance of the outliers from the centroid of its class using Eq. (13).

$$\begin{aligned} d = (||X_c - X_i||_2 )^{1/2} \end{aligned}$$
(13)

where d is the distance between the centroid \(X_c\) and the outlier \(X_i\). Now the outliers are weighed with respect to the distance and accordingly, we choose a monotonic decreasing function of distance given by (14).

$$\begin{aligned} w_d = 2/(1 + exp(d)) \end{aligned}$$
(14)

The weights of the outliers are adjusted by multiplying with this quantity \(w_d\). By this, the outlier that is located far from its centroid gets assigned less weight compared to the outlier that is closer to its centriod. This would result in reduction of weights in outliers and the difference is distributed proportionately among other observations. The proposed method is summarized in Algorithm 1.

figure a

3.3 Computation complexity

We analyze the computational complexity of computing the propensity score. Propensity score of an observation is based on the number of observations of the same class in its K-NN. Let the data set have N examples in the n-dimensional feature space. We know that the time complexity of computation of distance between two points in \(R^n\) is \({\mathcal {O}}(n)\). Since \((N-1)\) distances are computed for each example, the corresponding time complexity is \({\mathcal {O}}(N - 1)n\). These distances are sorted using Radix sort. The time complexity of sorting the \((N - 1)\) distances using Radix sort is \({\mathcal {O}}(N - 1 + b) \log _b D)\) and selecting the first k observations is \({\mathcal {O}}(1)\). Hence, total time taken for one example is \({\mathcal {O}}(N - 1)n + {\mathcal {O}}(N - 1 + b) log_b D\). As we have total N observations, the total time complexity is \({\mathcal {O}}((N - 1)n + (N - 1 + b) \log _b D) * N\) which is approximately \({\mathcal {O}}(N^2n)\). The present study uses this propensity score, though it can become expensive for large data sets having large number of features. Further, if categorical variables are present in the dataset, one hot encoding is generally used for its representation. This results in increasing the number of features which further increases the time complexity. Appropriate distance measure needs to be used for calculation of Propensity score to handle non continuous variables which we intend to study later.

4 Evaluation metrics and parameter settings

This section discusses the evaluation metrics used to evaluate the performance of the proposed method. The parameter settings used in the network are also presented.

4.1 Evaluation metrics

The proposed method is evaluated on the basis of evaluation metrics that are commonly used in the literature (He et al. 2008; He and Garcia 2008; Tharwat 2018), being derived from the confusion matrix Table 1. In this confusion matrix, rows denote the number of true class examples and the columns denote the number of examples classified by the classifier.

Some of the well known metrics used for imbalanced data sets are precision, recall, F-measure and G-mean (He and Garcia 2008). These metrics are defined as

$$\begin{aligned} Precision= & {} \frac{TP}{TP + FP} \end{aligned}$$
(15)
$$\begin{aligned} Recall= & {} \frac{TP}{TP + FN} \end{aligned}$$
(16)
$$\begin{aligned} F-Measure= & {} \frac{(1+\beta ^2)Recall*Precision}{\beta ^2*(Recall+Precision)} \end{aligned}$$
(17)

Here \(\beta \) is a non-negative parameter that controls the influence of precision and recall. With \(\beta = 0\), F-Measure is same as recall and when \(\beta \rightarrow \infty \), it tends to Precision. In this study, we set \(\beta = 1\), giving equal importance to precision and recall.

$$\begin{aligned} G-Mean = \sqrt{\frac{TP}{TP+FN}\frac{TN}{TN+FP}} \end{aligned}$$
(18)

G-Mean measures the performance by combining the Recall of positive and negative classes (Guo and Viktor 2004b).

Another widely used graphical based representation for imbalanced domain is Receiving Operating Characteristic (ROC) curve (Provost et al. 1997). This representation of the performance of the classifier plots TP rates on the Y-axis and FP rates on the X-axis. The TP rates and FP rates are defined as

$$\begin{aligned} \text {TP rate}= & {} \frac{TP}{TP+FN} \end{aligned}$$
(19)
$$\begin{aligned} \text {FP rate}= & {} \frac{FP}{FP+TN} \end{aligned}$$
(20)

A quantitative representation of ROC curve is the area under this curve and is called AUC (Bradley 1997; Huang and Ling 2005).

Table 1 Confusion matrix

4.2 Parameter settings

In this study, the neural network has one hidden layer, and the number of neurons it contains is (No. of features + classes)/2, as considered in the literature (Guo and Viktor 2004a). The number of input neurons is equal to the number of features of the data set. In addition, batch normalization is used along with RelU activation function in the hidden layer and Sigmoid activation in the output layer. Binary cross entropy loss function has been optimized using Stochastic Optimizer Adam with learning rate 0.01 and number of epochs being set to 100. This network is built in PyTorch sequential model. The degree of overlap parameter \(C \in \{0.20, 0.15, 0.10, 0.05, 0.00\}\), as explained in the proposed method section.

As the proposed method assigns weights to the observations before training the network, it is a kind of pre-processing technique and hence we compare our proposed method with certain well known preprocessing based techniques with default parameter for handling class imbalance along with class overlapping, such as, SMOTE (Chawla et al. 2002), ENN (Wilson 1972), SMOTE + ENN (Batista et al. 2004), Tomek links (Tomek 1976) and SMOTE + Tomek (Batista et al. 2004). We evaluate the performance of the proposed method using five-fold stratified cross validation technique and compare with other existing methods on various metric measures.

5 Simulation studies

This section considers the evaluation of the proposed method on simulated data sets and compares its performance with some of the well known preprocessing based methods that are used for handling class imbalance along with class overlapping.

5.1 Simulation setup

In this section, we simulate twelve binary class data sets that can be divided into three groups (A, B and C) of different levels of class overlap, each comprising four data sets. Data sets of Group A are simulated using bivariate Gaussian distribution with the majority class having mean vector (0,0) and the minority vector having mean vector (3,3) with common covariance matrix \(I_{2 * 2}\). Group B data sets are simulated in a similar way except that the mean vector of the minority class is shifted towards the majority class from (3,3) to (1,1) to increase the class overlap. To further increase the overlap, data sets in Group C are generated in such a way that the mean vector of the minority class is shifted from (1,1) to (0.8, 0.8). Figure 4 shows the structure of three groups with different levels of overlapping. For each group, we create four data sets with different class imbalance ratios, approximately equal to 1:2, 1:5, 1:10 and 1:20. The data sets thus generated are listed in Table 2 along with the number of observations in each class.

Fig. 4
figure 4

Simulated data sets having different levels of overlapping

Table 2 Summary of the simulated data sets

5.2 Results

Tables 34 and 5 present the results of simulation studies for Groups A, B and C respectively. Tables 4 and 5 clearly indicate better performance of the proposed method for Groups B and C in terms of F-measure_1 (F1), G-Mean and AUC for all levels of imbalance ratios except SimDataC.3, where SMOTE_ENN performs slightly better than the proposed method in G-Mean measure. As the imbalance ratio increases, the performance gap between the proposed method and other methods becomes wider and particularly quite prominent in Group C data SimDataC.4, where the imbalance ratio is the highest. However, in case of Group A data sets where overlap is insignificant, SMOTE and SMOTE_TOMEK perform slightly better than the proposed method in terms of AUC and G-Mean, though the proposed method still performs much better in terms of F1.

Table 3 F-measure, G-mean & AUC values for Group A data sets
Table 4 F-measure, G-mean & AUC values for Group B data sets
Table 5 F-measure, G-mean & AUC values for Group C data sets

6 Experiments on real-life data sets

We evaluate the proposed method on 23 real-life data sets and compare its performance with the performance of preprocessing based methods which were used for the simulation studies.

Fifteen out of 23 data sets are chosen from KEEL data sets repository and the maximum class imbalance among these data sets is 58.28 and the maximum number of observations is 1484 (Alcalá-Fdez et al. 2011). For large-scale data set, breast cancer dataset has been used from the Knowledge Discovery and Data Mining Cup,Footnote 1 which contains 102,294 examples with an imbalance ratio of 163.20. Apart from this, three datasets have been considered from corporate bankruptcy domain, namely USA, Japan and Polish bankruptcy data (Zhou 2013; Zikeba et al. 2016).

USA data set contains observations from non-financial industry with financial status (Non-bankrupt or Bankrupt) as class label from 1981 to 2009. A bankrupt company is defined as the one whose reason for deletion is marked as “bankruptcy” or “liquidation” in the original Compustat North America dataset. As suggested by Zhou (2013), we use 10 explanatory variables to predict the financial status. These explanatory variables are: net income/total assets (NI/TA), current ratio (CR), retained earnings/total assets (RE/TA), working capital/total assets (WC/TA), EBIT/total assets (EBIT/TA), sales /total assets (S/TA), cash/total assets (C/TA), current assets/total assets (CA/TA), stock holder’s equity/total debt (SHE/TD) and cash/current liabilities(C/CL). The number of bankrupt and non-bankrupt firms between year 1981 to year 2009 are 918 and 85,211 respectively, with an imbalance ratio of 92.82. Japan bankruptcy dataset consists of only non-financial firms and indicates whether they were bankrupt or non-bankrupt during the period 1989 to 2009. As before, we use the same 10 explanatory variables. This data set has 59 bankrupt observations and 36,578 non-bankrupt observations, thus having an imbalance ratio of 619.97.

In case of Polish companies manufacturing sector data, the period of time considered was 2007-2013 for bankruptcy companies and 2000-2012 for operating companies. The data set is divided into 5 parts depending on the forecasting period. The 1st year data contains financial rates from 1st year of the forecasting period and the corresponding class label indicates bankruptcy status after 5 years. The 2nd year data contains financial rates from 2nd year of the forecasting period and the class label indicates bankruptcy status after 4 years. Similarly, 3rd year, 4th year, 5th year data sets have financial rates from 3rd, 4th and 5th years of the forecasting periods and the class labels indicate the bankruptcy status after 3, 2, and 1 year respectively. The characteristics of these data sets are listed in Table 6.

Table 6 Summary of the real-life data sets

Five-fold stratified cross validation is used to compare the performance of the proposed method with other existing methods, except for Japan and USA data sets. Following Zhou (2013), for Japan and USA data sets, models are trained on observations from year 1981 to 2001 and tested on observations between 2002 and 2009. The results of the analysis are presented in Tables 7, 8, 9, 10 and 11 (Best results are highlighted in bold face).

Table 7 F-measure, G-mean & AUC values for real-life data sets
Table 8 F-measure, G-mean & AUC values for real-life data sets
Table 9 F-measure, G-mean & AUC values for real-life data sets
Table 10 F-measure, G-mean & AUC values for real-life data sets
Table 11 F-Measure, G-Mean & AUC values for real-life data sets

It can be observed that the proposed method does really well on F-measure of minority class, G-Mean and AUC for different values of C. In the case of G-Mean, the proposed method outperforms the other methods on all data sets except Japan and yeast-2_vs_4 data sets, where SMOTE performs better than the other methods. Similarly, while using AUC measure, the proposed method outperforms all other methods on all data sets except Japan and yeast-2_vs_4, where SMOTE and SMOTE_ENN perform better respectively. Figure 5 shows ROC graphs of corporate bankruptcy datasets. It can be observed from these graphs that the proposed method performs better on all data sets except Japan data set. We can verify these observations from AUC values of Tables 7, 8, 9, 10 and 11.

To assess whether the proposed method shows significant improvement over the existing methods, we conducted Wilcoxon Signed Rank Test (Richardson 2010) on the F-measure of minority and majority class, G-mean and AUC. The null and alternative hypotheses are as follows:

  • \(H_0\): The median difference is zero

  • \(H_1\): The median difference is positive

Wilcoxon signed-rank test ranks the absolute difference between two classifiers. If the null hypothesis is true, the sum of the ranks corresponding to positive differences \((W+)\) and that of negative differences \((W-)\) should be nearly equal. The null hypothesis is rejected in favor of the above alternative hypothesis only if the test statistic \(W = W-\) is sufficiently small. For 23 data sets, to reject the null hypothesis at 0.05 significance level, W value must be less than 73 (Richardson 2010). Table 12 presents the details of Wilcoxon signed rank test for AUC values for the proposed method and SMOTE. As we can see from this table, \( W+ = 255\), \(W- = 21\), and thus W value = 21. \( W < 73\) indicates that the proposed method is superior compared to SMOTE in terms of AUC measure. Table 13 presents the summary of \(W+\), \(W-\) and W values for Wilcoxon signed rank test when comparing the proposed method with the other methods on F-measure of both the classes, G-Mean and AUC. The statistical tests indicate that the proposed method outperforms the other methods in terms of AUC, F-measure minority and G-mean measure.

7 Discussion

In literature, few preprocessing based techniques exist for handling class imbalance and class overlap. Traditionally, SMOTE handles class imbalance by oversampling observations. However, oversampling is carried out without considering its location in the feature space and thus oversampling in the overlapping region degrades the performance of the classifier. Further, oversampling increases the size of the training set, thus increasing the training time substantially. Edited Nearest Neighbours (ENN) and Tomek Links are clean up techniques that remove the overlapping observations. These techniques are combined with SMOTE to handle overlap resulting in techniques SMOTE + ENN and SMOTE + Tomek. However, these methods suffer loss of information due to removal of some of the overlapping observations.

Fig. 5
figure 5

Averaged ROC curves for Corporate bankruptcy data sets

The proposed method uses an entirely different approach of considering different costs of misclassification for observations while training the ANN. Different costs of misclassification for observations are incorporated by assigning different weights to the observations, depending on its location in the feature space. Accordingly, a dataset is divided into three regions: safe zone, overlapping and noisy or outlier observations. The extent of overlapping is quantified by computing the propensity score. The proposed method initially uses a parameter C to assign all the noisy observations a uniform weight less than or equal to the propensity score assigned to the overlapping observations. Further, the weights of all the observations in the minority class are proportionally increased to match the sum of majority class observations. Later, the weights for different types of outliers are adjusted depending on its distance from the centroid of the minority class. In this way, the resulting weights of safe zone observations are higher than those of overlapping observations which in turn are higher than noisy observations. Thus, the computation of weights using this approach helps ANN handle class imbalance as the total weight of minority class equals that of the majority class. Also, the weights of the overlapping observations decreasing with the extent of overlapping helps in increasing the true positive rate. Further, assigning least weight to outliers minimizes the shift of the decision boundary towards the majority class, resulting in reduction of false positive rate. Thus, the proposed method efficiently handles class imbalance and class overlap simultaneously.

Table 12 Illustration of significant test details of AUC between the proposed method and SMOTE
Table 13 Summary of Significant test results

8 Conclusion

In this study, we have proposed an overlap sensitive neural network for handling class imbalance along with class overlapping and presence of noisy observations. The method incorporates different costs of misclassification by computing different weights for observations depending on its location in the feature space. Twelve simulated data sets that vary with respect to class imbalance and class overlap were analyzed and the results show that the proposed method outperforms the other methods in terms of different metric measures. Further, the method tested on 23 publicly available data sets also shows superior performance of the proposed method on various performance measures such as F-measure, G-mean and AUC. Thus, this approach of training the ANN efficiently handles the problem of class imbalance and class overlap.