1 Introduction

The number of samples commonly differs from one class to another in classification problems. This problem, known as the imbalanced data set problem [1,2,3,4,5,6,7], arises in most real-world applications. The point is that most current inductive learning principles resides on a sum of squared errors that do not take priors into account, which generally results in a classification bias towards the majority class.

One possible approach to handle this problem is to consider an alternative criterion to the overall learning error [4, 8, 9]. Other solution is the use of data resampling, which indirectly modifies the selection probability of the patterns during the learning phase. According to the Bayesian decision theory, the effect of changing the prior probabilities is analogous to set a new decision boundary for a probability-based classifier [10]. Many data resampling techniques have been proposed in the Literature, as for example, “Synthetic Minority Oversampling Technique” (SMOTE) [11], “Weighted Wilson’s editing” (WWE) [12] and “Adaptive Synthetic Sampling” (ADASYN) [13]. However, it has been shown that the classifier performance depends on both an ad hoc parameter setting (eg, percentage of data to be under- or over-sampled in a class and scale of local neighborhood, to mention a few) and the choice of the classifier itself. Experiments in [8, 14] suggested that well-known resampling techniques do not lead to performance improvement in Multi-Layer Perceptron (MLP) neural networks, even in the case of optimized parameter settings. Another solution is the use of ensemble learning, that has shown improvements for MLPs [15, 16]. Ensemble extensions for imbalanced learning consider changes in the pattern probability function during the training phase. Such change has an effect on the model selection criterion, since the lowest overall error rate, as in Adaboost [17], gives way to a balanced decision among the classes accuracy rates given the consideration of the respective priors. Since the ensemble approach is based on a combination of different hypotheses (eg, MLPs neural networks), it usually leads to longer training times. This is the case especially when MLPs are used as weak learners.

This paper presents a novel approach to deal with the imbalanced data set problem in neural networks by incorporating prior probabilities into a cost-sensitive cross-entropy error function. The usual overall error formulation for MLPs is explicitly modified to incorporate unequal misclassification costs [18, 19]. Unlike other cost-sensitive approaches in MLP learning [8, 20], each class contribution in the cross-entropy error function is weighted by its respective class prior probability. This approach results on well-balanced decision boundaries.

The remainder of this paper is organized as follows. Section 2 describes the learning problem using the cross-entropy error function, and Sect. 3 presents the modified cost-sensitive cross-entropy error function by considering the prior probabilities of the classes. The methodology, experiments and results are shown and discussed in Sect. 4. Final considerations are given in Sect. 5.

2 The Learning Problem

In classification problems, considering the learning set \(\mathbf {S} = \{(\mathbf {x}_i,\mathbf {y}_i) \in \mathcal {X} \times \mathcal {Y} \, | \, i=1,\dots ,N\}\), the output labels \(\mathbf {y}_i\) given the inputs \(\mathbf {x}_i\) are generated by an unknown function \(f(\mathbf {x})\). The objective is to estimate it as close as possible by means of a model \(f(\mathbf {x}\,|\,\theta )\), where \(\theta \) is the parameter set. Instead of adopting an empirical risk, often based on the Mean Squared Error (MSE) metric, another way to estimate \(\theta \) is through the cross-entropy error function (Eq. 1), where \(\hat{y}=f(\mathbf {x}\,|\,\theta )\) is the model output, given the learning of the (\(\mathcal {X}\),\(\mathcal {Y}\))-mapping function.

$$\begin{aligned} J(\theta )=\frac{1}{m}\sum _{i=1}^{m}\left[ -\,y_i\log (\hat{y}_i)-(1-y_i)\log (1-\hat{y}_i) \right] \end{aligned}$$
(1)

This function was chosen in place of the MSE one since it is convex and more suitable for calculating posterior probabilities in the case of neural networks [21]. When \(y_i=0\), it reduces to \(-\log (1-\hat{y}_i)\), and otherwise, with \({y}_i=1\), to \(-\log (\hat{y}_i)\). Whatever the case, the error decreases logarithmically as \(\hat{y}_i\) tends to \({y}_i\) (Fig. 1). Moreover, since the curves are symmetrical, the error reduction happens at the same logarithmic rate for both classes (that is, \({y}_i = 0\) and \({y}_i = 1\)). For a balanced learning problem, the error from [\(-\log (\hat{y}_i)\)] will be proportional to that from [\(-\log (1-\hat{y}_i)\)], once each term will account for \(50\%\) of the total error \(J(\theta )\) for a given model output \(\hat{y}\). However, in the case of imbalanced data, the term associated to the majority class will have larger influence on \(J(\theta )\). This occurs because the overall error, which is a sum of the individual terms, is minimized regardless of the class that generated the error.

Fig. 1
figure 1

Illustration of the cross-entropy error function \(J(\theta )\) for a range of model outputs \(\hat{y}\) given \({y_i=1}\) (solid line) and \({y_i=0}\) (dashed line)

Fig. 2
figure 2

A two-Gaussian problem

As an example, consider a binary classification problem with classes A and B, each one containing 200 samples (Fig. 2). A MLP neural network with two inputs, two hidden neurons and one output, was then identified using the classical cross-entropy error function (Eq. 1). Also, consider imbalanced scenarios with class A having 5, 50 and 100 random samples among the original 200 instances. To evaluate the contribution of each class on the cross-entropy error function, the ratio R (Eq. 2) was calculated along 1000 iterations. Figure 3 depicts the obtained results.

$$\begin{aligned} R=\frac{-\mathbf {y}\log (\hat{\mathbf {y}})}{-(1-{\mathbf {y}}) \log (1-\hat{\mathbf {y}})} \end{aligned}$$
(2)

It can be observed that R is approximately constant when the prior probabilities of the classes are equal, since each class contributes equally to this ratio. The effect of the imbalance levels can also be observed; the greater it is, the more R tends to stabilize at a higher value. The discrepancy between both priors is penalized in the computation of R mainly in the initial iterations. This behavior led to the proposal of the cost-sensitive approach presented in the next section.

Fig. 3
figure 3

Cross-entropy imbalance ratio R (Eq. 2) during learning of balanced and imbalanced data sets

3 Cost-Sensitive Cross-Entropy Error Function Approach

The discrepancy of the error rates between balanced and imbalanced data may be treated considering the optimal decision rule given in Eq. 3 [22]. An approximate unit ratio between [\(-\,{y'}_i^{(j)}\log (\hat{y'}_i^{(j)})\)] and [\(-\,(1-{y'}_i^{(j)})\log (1-\hat{y'}_i^{(j)})\)] is expected for balanced problems; however, as shown in Fig. 3, the ratio will reflect the priors. It also tends to stabilize in one, once the contribution of the minority class tends to become more influential as the number of iterations increases compensating the priors.

$$\begin{aligned} f_0(x) ={\left\{ \begin{array}{ll} 1,&{} \text {if }\frac{p(\mathbf {x}|\mathbf {y}=1)}{p(\mathbf {x}|\mathbf {y}=0)} \ge \frac{p(\mathbf {y}=0)}{p(\mathbf {y}=1)} \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(3)

It is clear that the prior probabilities ratio \(\frac{p(\mathbf {y}=0)}{p(\mathbf {y}=1)}\) plays an important role in the classification balance between classes, and then it could be used to compensate the imbalance. One way to accomplish this is to incorporate this ratio into the cross-entropy error function, as shown in Eqs. 4 and 5, where N is the number of samples of the positive class (\(y=1\)) and M is the total number of samples.

$$\begin{aligned}&\displaystyle J(\theta )=\frac{1}{m}\sum _{i=1}^{m}\left[ -\,y_i\log (\hat{y}_i)\lambda -(1-y_i)\log (1-\hat{y}_i) (1-\lambda )\right] \end{aligned}$$
(4)
$$\begin{aligned}&\displaystyle {\lambda }^{(j)} = \left( \frac{N}{M}\right) ^{-1} \end{aligned}$$
(5)

As an example, considering that \(\frac{N^{\text {Class A(1)}}}{N^{\text {Class B(0)}}} = 0.20\), Fig. 4 shows that the magnitude of [\(-\,y_i\log (\hat{y}_i)\lambda \)] decays faster than that of [\(-\,(1-y_i)\log (1-\hat{y}_i)(1-\lambda )\)]. The gradient descent can also be applied to Eq. 4 in order to obtain \(\partial J(\theta )/\partial \theta ^{(n-1)}\) and \(\partial J(\theta )/\partial z^{(n-1)}\), as shown by Equations from 6 to 12, where n is the last layer of neurons and \(\hat{y}_i = f(\mathbf {x}|\theta ,z)\). Using the conventional cross-entropy error function (Eq. 1), it can be noted that changing only the error from the output layer, that is, [\(\delta ^{(n)} = g(z^{(n)})-y\) by \( \delta {(n)} = \left( {q}g({z}^{(n)})-{\lambda } {y} \right) + {y}g({z}^{(n)})({\lambda } - {q})\)], is enough to meet all other equations. Applying this approach to the previous two Gaussian example (Sect. 2), Fig. 5 shows that the cross-entropy error rate (dashed line) is kept almost constant when considering the priors, which results in a balance between the classes. The gradient descent with the Rprop algorithm [23] was used in all experiments in the next section. Although the present approach is pattern-based, there is no constraint in the present formulation for a further matrix representation of the problem, as the one presented by [24].

$$\begin{aligned}&\displaystyle \frac{\partial J(\theta )}{\partial \theta ^{(n-1)}} = \left[ \left( {q}g({z}^{(n)})-{\lambda } {y} \right) + {y}g({z}^{(n)})({\lambda } - {q}) \right] {a}^{(n-1)} \end{aligned}$$
(6)
$$\begin{aligned}&\displaystyle \frac{\partial J(\theta )}{\partial z^{(n-1)}} = \delta ^{(n)}\theta ^{(n-1)}g(z^{(n-1)})(1-g(z^{(n-1)})) \end{aligned}$$
(7)
$$\begin{aligned}&\displaystyle \frac{\partial J(\theta )}{\partial z^{(n-1)}} = \delta ^{(n-1)} \end{aligned}$$
(8)
$$\begin{aligned}&\displaystyle q = \left( 1- ({N}/{M})\right) ^{-1} \end{aligned}$$
(9)
$$\begin{aligned}&\displaystyle {q}g({z}^{(n)})-{\lambda } {y} = {\gamma } \end{aligned}$$
(10)
$$\begin{aligned}&\displaystyle {y}g({z}^{(n)})({\lambda } - {q}) = \beta \end{aligned}$$
(11)
$$\begin{aligned}&\displaystyle \gamma + \beta = \delta \end{aligned}$$
(12)
Fig. 4
figure 4

Minority class output obtained with the proposed approach based on the cross-entropy error function \(J(\theta )\) for a range of model outputs \(h_{\theta }(\mathbf {x})\) given \(\mathbf {y_i}=1\) (solid line) and \(\mathbf {y_i=0}\) (dashed line)

Fig. 5
figure 5

Cross-entropy rate for MLP

4 Results and Discussion

4.1 Experimental Study

An empirical study was conducted using 16 data sets from the UCI repository. The proposed cost-sensitive error function (CSEFMLP) was compared to five well-known techniques, namely SMOTE [11], weighted Wilson’s editing (WWE) [12], Rprop [23], SMOTE + Tomek Links (SMTTL) [25], and RAMOBoost [26]. Table 1 depicts their main characteristics. After a preprocessing as in [8], twenty different trials were carried out for each data set by shuffling their original indexes. In sequence, they were split into training subset (\(70\%\)), which was employed in a 7-fold cross-validation procedure for model selection, and test subset, used for performance evaluation. The following metrics were employed with this aim, namely the Kubat’s G-Mean metric, that takes into account a balance between true positive and true negative rates given by \(\sqrt{TPr \cdot TNr}\) [27]; the Area under the ROC curve (AUC), that also considers how well positive classes are ranked [28]; the Adjusted Geometric-Mean, a recent metric that proposes a balance between Specificity and Sensitivity favoring the latter [29]; the Accuracy, to show that it can be a trick metric if evaluated alone in unbalanced data problems; the True Positive Rate (TPR) and the True Negative Rate (TNR), that play an important role given the type of balance is being pursued by the model; and the F1 score, which is the harmonic mean of the precision and recall. The results for TPR and TNR were obtained considering the model with the greater accuracy.

Table 1 Characteristics of the data sets

4.2 Non-parametric Test

Model comparison procedures usually employ parametric tests. However, in this case, a non-parametric one is more adequate [30], namely the Nemenyi post-hoc statistical test (\(F_F\)) (Eq. 13), which is derived from the Friedman statistics (\(\chi _F^2\)) (Eq. 14) [31]. This test allows a simultaneous comparison of multiple classifiers (L) given multiple data sets (M). The null-hypothesis \(H_{0}\) states that all algorithms perform similarly. In this case, they present equal average ranks (\(R_j\)), where \(R_j=\frac{1}{M}\sum _{i=1}^{M}r_i^j\), \(1\le j \le L\), is the average rank of the jth algorithm given all data sets.

$$\begin{aligned}&\displaystyle F_F = \frac{(M-1)\chi _F^2}{M(L-1)-\chi _F^2} \end{aligned}$$
(13)
$$\begin{aligned}&\displaystyle \chi _F^2 = \frac{12M}{L(L+1)}\left( \sum _{j=1}^{L}R_j^2 - \frac{L(L+1)^2}{4}\right) \end{aligned}$$
(14)

In case of rejection of the null hypothesis, another statistical test should be carried out to quantify the differences among the algorithms [30]. The most usual is the Bonferroni-Dunn post-hoc test [32]. Two classifiers are considered not similar if the difference between their average ranks is greater than a critical difference (CD; Eq. 15), where \(q_{\alpha }\) is based on the Student statistic.

$$\begin{aligned} CD = q_\alpha \sqrt{\frac{L(L+1)}{6M}} \end{aligned}$$
(15)

4.3 Results

Tables 2, 3, 4, 5, 6, 7 and 8 summarize the results for the considered metrics, namely G-Mean, AUC, Adjusted G-Mean, Accuracy, True Positive Rate (TPR), True Negative Rate (TNR) and F1 score, respectively. Given a data set, the highest classification score is highlighted in bold. The good performance of the proposed approach in comparison to well-known classifiers can be observed in general. For G-Mean, Adjusted G-Mean and TPR, it is the classifier that presents the highest number of best ratings, and for AUC, this number is similar to Rprop. For Accuracy, the scores are generally close to the highest classification scores mainly obtained from Rprop and RAMOboost, even though this metric is not a focus on unbalanced data problems. For TNR, a worse rating is expected since the proposed approach searches for a good balance between TPR and TNR, which led to a good average rank for TPR.

Table 2 Average values for G-Mean
Table 3 Average values for AUC
Table 4 Average values for Adjusted G-Mean
Table 5 Average values for accuracy
Table 6 Average values for True Positive Rate
Table 7 Average values for True Negative Rate
Table 8 Average values for F1-score

The average ranks (\(R_j\)) are depicted in the last rows of Tables 2, 3, 4, 5, 6, 7 and 8. The lower the value, the better. The proposed approach yields to the lowest values for the G-Mean, AUC, Adjusted G-Mean and TPR metrics. Next, the Nemenyi post-hoc statistical test was accomplished for an overall performance evaluation of the classifiers, with \(M = 16\) (number of data sets) and \(L = 6\) (number of classifiers). The test statistics \((F_F)\) for the G-Mean, AUC, Adjusted G-Mean, Accuracy, True Positive Rate, True Negative Rate and F1-score metrics are, respectively, equal to 3.3206, 2.4818, 6.8324, 10.1638, 4.1453, 9.3699 and 1.2870. Given the critical value \(F_{F;5;75;\alpha = 0.01} = 1.9256\), except for F1-score, the null-hypothesis that all algorithms perform similarly was rejected. The Bonferroni-Dunn post-hoc test was then used for evaluation of the proposed approach (CSEFMLP). Table 9 shows the pairwise differences given all classifiers. Values beyond the critical difference (\(CD = 1.5385\); \(\alpha = 0.1\)) are highlighted in bold. According to the G-Mean metric, the CSEFMLP classifier is significantly better than Rprop and SMOTE and slightly better than SMTLL, RAMOBoost and WWE. Regarding the AUC metric, it outperforms the SMOTE and RAMOBoost classifiers and is slightly better than SMTTL and WWE. For the Adjusted G-Mean metric, it is better than Rprop and RAMOBoost and slightly better than WWE. Given the Accuraccy metric, the CSEFMLP performs better than Rprop and RAMOBoost and slightly better than SMOTE. According to the True Positive Rate metric, CSEFMLP outperforms Rprop and SMOTE and is slightly better than RAMOBoost and SMTTL, and to the True Negative Rate metric, it is significantly better than RAMOBoost and Rprop and slightly better than SMTTL. And for the F1-score, CSEFMLP performs statistically equal to all classifiers. This set of results shows the efficiency and robustness of the proposed approach, which is based on the cost-sensitive cross-entropy error function, to handle unbalanced data problems.

Table 9 Bonferroni-Dunn post-hoc test

5 Conclusion

This work proposes a new approach, called CSEFMLP (Cost-Sensitive Cross-Entropy Error Function for MLP neural networks), to handle the common unbalanced classification problem. This method generally performs better or at least similarly to well-known classifiers considering a set of performance metrics for unbalanced problems, namely G-Mean, AUC, Adjusted G-Mean, Accuraccy, True Positive Rate, True Negative Rate and F1-score. In short, the obtained results demonstrate that the proposed approach is able to deal with unbalanced data.