Abstract
This paper presents a novel approach to deal with the imbalanced data set problem in neural networks by incorporating prior probabilities into a cost-sensitive cross-entropy error function. Several classical benchmarks were tested for performance evaluation using different metrics, namely G-Mean, area under the ROC curve (AUC), adjusted G-Mean, Accuracy, True Positive Rate, True Negative Rate and F1-score. The obtained results were compared to well-known algorithms and showed the effectiveness and robustness of the proposed approach, which results in well-balanced classifiers given different imbalance scenarios.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The number of samples commonly differs from one class to another in classification problems. This problem, known as the imbalanced data set problem [1,2,3,4,5,6,7], arises in most real-world applications. The point is that most current inductive learning principles resides on a sum of squared errors that do not take priors into account, which generally results in a classification bias towards the majority class.
One possible approach to handle this problem is to consider an alternative criterion to the overall learning error [4, 8, 9]. Other solution is the use of data resampling, which indirectly modifies the selection probability of the patterns during the learning phase. According to the Bayesian decision theory, the effect of changing the prior probabilities is analogous to set a new decision boundary for a probability-based classifier [10]. Many data resampling techniques have been proposed in the Literature, as for example, “Synthetic Minority Oversampling Technique” (SMOTE) [11], “Weighted Wilson’s editing” (WWE) [12] and “Adaptive Synthetic Sampling” (ADASYN) [13]. However, it has been shown that the classifier performance depends on both an ad hoc parameter setting (eg, percentage of data to be under- or over-sampled in a class and scale of local neighborhood, to mention a few) and the choice of the classifier itself. Experiments in [8, 14] suggested that well-known resampling techniques do not lead to performance improvement in Multi-Layer Perceptron (MLP) neural networks, even in the case of optimized parameter settings. Another solution is the use of ensemble learning, that has shown improvements for MLPs [15, 16]. Ensemble extensions for imbalanced learning consider changes in the pattern probability function during the training phase. Such change has an effect on the model selection criterion, since the lowest overall error rate, as in Adaboost [17], gives way to a balanced decision among the classes accuracy rates given the consideration of the respective priors. Since the ensemble approach is based on a combination of different hypotheses (eg, MLPs neural networks), it usually leads to longer training times. This is the case especially when MLPs are used as weak learners.
This paper presents a novel approach to deal with the imbalanced data set problem in neural networks by incorporating prior probabilities into a cost-sensitive cross-entropy error function. The usual overall error formulation for MLPs is explicitly modified to incorporate unequal misclassification costs [18, 19]. Unlike other cost-sensitive approaches in MLP learning [8, 20], each class contribution in the cross-entropy error function is weighted by its respective class prior probability. This approach results on well-balanced decision boundaries.
The remainder of this paper is organized as follows. Section 2 describes the learning problem using the cross-entropy error function, and Sect. 3 presents the modified cost-sensitive cross-entropy error function by considering the prior probabilities of the classes. The methodology, experiments and results are shown and discussed in Sect. 4. Final considerations are given in Sect. 5.
2 The Learning Problem
In classification problems, considering the learning set \(\mathbf {S} = \{(\mathbf {x}_i,\mathbf {y}_i) \in \mathcal {X} \times \mathcal {Y} \, | \, i=1,\dots ,N\}\), the output labels \(\mathbf {y}_i\) given the inputs \(\mathbf {x}_i\) are generated by an unknown function \(f(\mathbf {x})\). The objective is to estimate it as close as possible by means of a model \(f(\mathbf {x}\,|\,\theta )\), where \(\theta \) is the parameter set. Instead of adopting an empirical risk, often based on the Mean Squared Error (MSE) metric, another way to estimate \(\theta \) is through the cross-entropy error function (Eq. 1), where \(\hat{y}=f(\mathbf {x}\,|\,\theta )\) is the model output, given the learning of the (\(\mathcal {X}\),\(\mathcal {Y}\))-mapping function.
This function was chosen in place of the MSE one since it is convex and more suitable for calculating posterior probabilities in the case of neural networks [21]. When \(y_i=0\), it reduces to \(-\log (1-\hat{y}_i)\), and otherwise, with \({y}_i=1\), to \(-\log (\hat{y}_i)\). Whatever the case, the error decreases logarithmically as \(\hat{y}_i\) tends to \({y}_i\) (Fig. 1). Moreover, since the curves are symmetrical, the error reduction happens at the same logarithmic rate for both classes (that is, \({y}_i = 0\) and \({y}_i = 1\)). For a balanced learning problem, the error from [\(-\log (\hat{y}_i)\)] will be proportional to that from [\(-\log (1-\hat{y}_i)\)], once each term will account for \(50\%\) of the total error \(J(\theta )\) for a given model output \(\hat{y}\). However, in the case of imbalanced data, the term associated to the majority class will have larger influence on \(J(\theta )\). This occurs because the overall error, which is a sum of the individual terms, is minimized regardless of the class that generated the error.
As an example, consider a binary classification problem with classes A and B, each one containing 200 samples (Fig. 2). A MLP neural network with two inputs, two hidden neurons and one output, was then identified using the classical cross-entropy error function (Eq. 1). Also, consider imbalanced scenarios with class A having 5, 50 and 100 random samples among the original 200 instances. To evaluate the contribution of each class on the cross-entropy error function, the ratio R (Eq. 2) was calculated along 1000 iterations. Figure 3 depicts the obtained results.
It can be observed that R is approximately constant when the prior probabilities of the classes are equal, since each class contributes equally to this ratio. The effect of the imbalance levels can also be observed; the greater it is, the more R tends to stabilize at a higher value. The discrepancy between both priors is penalized in the computation of R mainly in the initial iterations. This behavior led to the proposal of the cost-sensitive approach presented in the next section.
3 Cost-Sensitive Cross-Entropy Error Function Approach
The discrepancy of the error rates between balanced and imbalanced data may be treated considering the optimal decision rule given in Eq. 3 [22]. An approximate unit ratio between [\(-\,{y'}_i^{(j)}\log (\hat{y'}_i^{(j)})\)] and [\(-\,(1-{y'}_i^{(j)})\log (1-\hat{y'}_i^{(j)})\)] is expected for balanced problems; however, as shown in Fig. 3, the ratio will reflect the priors. It also tends to stabilize in one, once the contribution of the minority class tends to become more influential as the number of iterations increases compensating the priors.
It is clear that the prior probabilities ratio \(\frac{p(\mathbf {y}=0)}{p(\mathbf {y}=1)}\) plays an important role in the classification balance between classes, and then it could be used to compensate the imbalance. One way to accomplish this is to incorporate this ratio into the cross-entropy error function, as shown in Eqs. 4 and 5, where N is the number of samples of the positive class (\(y=1\)) and M is the total number of samples.
As an example, considering that \(\frac{N^{\text {Class A(1)}}}{N^{\text {Class B(0)}}} = 0.20\), Fig. 4 shows that the magnitude of [\(-\,y_i\log (\hat{y}_i)\lambda \)] decays faster than that of [\(-\,(1-y_i)\log (1-\hat{y}_i)(1-\lambda )\)]. The gradient descent can also be applied to Eq. 4 in order to obtain \(\partial J(\theta )/\partial \theta ^{(n-1)}\) and \(\partial J(\theta )/\partial z^{(n-1)}\), as shown by Equations from 6 to 12, where n is the last layer of neurons and \(\hat{y}_i = f(\mathbf {x}|\theta ,z)\). Using the conventional cross-entropy error function (Eq. 1), it can be noted that changing only the error from the output layer, that is, [\(\delta ^{(n)} = g(z^{(n)})-y\) by \( \delta {(n)} = \left( {q}g({z}^{(n)})-{\lambda } {y} \right) + {y}g({z}^{(n)})({\lambda } - {q})\)], is enough to meet all other equations. Applying this approach to the previous two Gaussian example (Sect. 2), Fig. 5 shows that the cross-entropy error rate (dashed line) is kept almost constant when considering the priors, which results in a balance between the classes. The gradient descent with the Rprop algorithm [23] was used in all experiments in the next section. Although the present approach is pattern-based, there is no constraint in the present formulation for a further matrix representation of the problem, as the one presented by [24].
4 Results and Discussion
4.1 Experimental Study
An empirical study was conducted using 16 data sets from the UCI repository. The proposed cost-sensitive error function (CSEFMLP) was compared to five well-known techniques, namely SMOTE [11], weighted Wilson’s editing (WWE) [12], Rprop [23], SMOTE + Tomek Links (SMTTL) [25], and RAMOBoost [26]. Table 1 depicts their main characteristics. After a preprocessing as in [8], twenty different trials were carried out for each data set by shuffling their original indexes. In sequence, they were split into training subset (\(70\%\)), which was employed in a 7-fold cross-validation procedure for model selection, and test subset, used for performance evaluation. The following metrics were employed with this aim, namely the Kubat’s G-Mean metric, that takes into account a balance between true positive and true negative rates given by \(\sqrt{TPr \cdot TNr}\) [27]; the Area under the ROC curve (AUC), that also considers how well positive classes are ranked [28]; the Adjusted Geometric-Mean, a recent metric that proposes a balance between Specificity and Sensitivity favoring the latter [29]; the Accuracy, to show that it can be a trick metric if evaluated alone in unbalanced data problems; the True Positive Rate (TPR) and the True Negative Rate (TNR), that play an important role given the type of balance is being pursued by the model; and the F1 score, which is the harmonic mean of the precision and recall. The results for TPR and TNR were obtained considering the model with the greater accuracy.
4.2 Non-parametric Test
Model comparison procedures usually employ parametric tests. However, in this case, a non-parametric one is more adequate [30], namely the Nemenyi post-hoc statistical test (\(F_F\)) (Eq. 13), which is derived from the Friedman statistics (\(\chi _F^2\)) (Eq. 14) [31]. This test allows a simultaneous comparison of multiple classifiers (L) given multiple data sets (M). The null-hypothesis \(H_{0}\) states that all algorithms perform similarly. In this case, they present equal average ranks (\(R_j\)), where \(R_j=\frac{1}{M}\sum _{i=1}^{M}r_i^j\), \(1\le j \le L\), is the average rank of the jth algorithm given all data sets.
In case of rejection of the null hypothesis, another statistical test should be carried out to quantify the differences among the algorithms [30]. The most usual is the Bonferroni-Dunn post-hoc test [32]. Two classifiers are considered not similar if the difference between their average ranks is greater than a critical difference (CD; Eq. 15), where \(q_{\alpha }\) is based on the Student statistic.
4.3 Results
Tables 2, 3, 4, 5, 6, 7 and 8 summarize the results for the considered metrics, namely G-Mean, AUC, Adjusted G-Mean, Accuracy, True Positive Rate (TPR), True Negative Rate (TNR) and F1 score, respectively. Given a data set, the highest classification score is highlighted in bold. The good performance of the proposed approach in comparison to well-known classifiers can be observed in general. For G-Mean, Adjusted G-Mean and TPR, it is the classifier that presents the highest number of best ratings, and for AUC, this number is similar to Rprop. For Accuracy, the scores are generally close to the highest classification scores mainly obtained from Rprop and RAMOboost, even though this metric is not a focus on unbalanced data problems. For TNR, a worse rating is expected since the proposed approach searches for a good balance between TPR and TNR, which led to a good average rank for TPR.
The average ranks (\(R_j\)) are depicted in the last rows of Tables 2, 3, 4, 5, 6, 7 and 8. The lower the value, the better. The proposed approach yields to the lowest values for the G-Mean, AUC, Adjusted G-Mean and TPR metrics. Next, the Nemenyi post-hoc statistical test was accomplished for an overall performance evaluation of the classifiers, with \(M = 16\) (number of data sets) and \(L = 6\) (number of classifiers). The test statistics \((F_F)\) for the G-Mean, AUC, Adjusted G-Mean, Accuracy, True Positive Rate, True Negative Rate and F1-score metrics are, respectively, equal to 3.3206, 2.4818, 6.8324, 10.1638, 4.1453, 9.3699 and 1.2870. Given the critical value \(F_{F;5;75;\alpha = 0.01} = 1.9256\), except for F1-score, the null-hypothesis that all algorithms perform similarly was rejected. The Bonferroni-Dunn post-hoc test was then used for evaluation of the proposed approach (CSEFMLP). Table 9 shows the pairwise differences given all classifiers. Values beyond the critical difference (\(CD = 1.5385\); \(\alpha = 0.1\)) are highlighted in bold. According to the G-Mean metric, the CSEFMLP classifier is significantly better than Rprop and SMOTE and slightly better than SMTLL, RAMOBoost and WWE. Regarding the AUC metric, it outperforms the SMOTE and RAMOBoost classifiers and is slightly better than SMTTL and WWE. For the Adjusted G-Mean metric, it is better than Rprop and RAMOBoost and slightly better than WWE. Given the Accuraccy metric, the CSEFMLP performs better than Rprop and RAMOBoost and slightly better than SMOTE. According to the True Positive Rate metric, CSEFMLP outperforms Rprop and SMOTE and is slightly better than RAMOBoost and SMTTL, and to the True Negative Rate metric, it is significantly better than RAMOBoost and Rprop and slightly better than SMTTL. And for the F1-score, CSEFMLP performs statistically equal to all classifiers. This set of results shows the efficiency and robustness of the proposed approach, which is based on the cost-sensitive cross-entropy error function, to handle unbalanced data problems.
5 Conclusion
This work proposes a new approach, called CSEFMLP (Cost-Sensitive Cross-Entropy Error Function for MLP neural networks), to handle the common unbalanced classification problem. This method generally performs better or at least similarly to well-known classifiers considering a set of performance metrics for unbalanced problems, namely G-Mean, AUC, Adjusted G-Mean, Accuraccy, True Positive Rate, True Negative Rate and F1-score. In short, the obtained results demonstrate that the proposed approach is able to deal with unbalanced data.
References
Chawla NV, Japkowicz N, Kotcz A (2004a) Special issue on learning from imbalanced data sets. SIGKDD Explor 6(1):1–6
Chawla N, Japkowicz N, Kolcz A (2004b) Special issue on learning from imbalanced data sets. In: Editorial of the ACM SIGKDD explorations newsletter
He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
Bhowan U, Johnston M, Zhang M, Yao X (2013) Evolving diverse ensembles using genetic programming for classification with unbalanced data. IEEE Trans Evol Comput 17(3):368–386
Frasca M, Bertoni A, Re M, Valentini G (2013) A neural network algorithm for semi-supervised node label learning from unbalanced data. Neural Netw 43:84–98
Wang L, Yang B, Chen Y, Zhang X, Orchard J (2017) Improving neural-network classifiers using nearest neighbor partitioning. IEEE Trans Neural Netw Learn Syst 28(10):2255–2267
Castro CL, Braga AP (2013) Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data. IEEE Trans Neural Netw Learn Syst 24(6):888–899
Oh SH (2011) A statistical perspective of neural networks for imbalanced data problems. Int J Contents 7(3):1–5
Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 321–357
Barandela R, Valdovinos RM, Sánchez JS, Ferri FJ (2004) The imbalanced training sample problem: under or over sampling? In: Structural, syntactic, and statistical pattern recognition. Springer, pp 806–814
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, pp 1322–1328
Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) Supervised neural network modeling: an empirical investigation into learning from imbalanced data with labeling errors. IEEE Trans Neural Netw 21(5):813–830
Chen S, He H, Garcia EA (2010) Ramoboost: ranked minority oversampling in boosting. IEEE Trans Neural Netw 21(10):1624–1642
Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378
Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336
Kukar M, Kononenko I (1998) Cost-sensitive learning with neural networks. In: ECAI, pp 445–449
Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence. Lawrence Erlbaum Associates Ltd, pp 973–978
Alejo R, García V, Sotoca JM, Mollineda RA, Sánchez JS (2007) Improving the performance of the rbf neural networks trained with imbalanced samples. In: Computational and ambient intelligence. Springer, pp 162–169
Kline DM, Berardi VL (2005) Revisiting squared-error and cross-entropy functions for training neural network classifiers. Neural Comput Appl 14(4):310–318
Berger JO (2010) Statistical decision theory and Bayesian analysis, 2nd edn. Springer, New York
Riedmiller M, Braun H (1993) A direct adaptive method for faster back propagation learning: the rprop algorithm. In: IEEE international conference on neural networks. IEEE, pp 586–591
Zhu C, Wang Z (2017) Entropy-based matrix learning machine for imbalanced data sets. Pattern Recognit Lett 88:72–80
Tomek I (1976) Two modifications of cnn. IEEE Trans Syst Man Cybern 6:769–772
Provost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42(3):203–231
Kubat M, Matwin S (1997) Addressing the curse of imbalanced trainingsets: one-sided selection. In: ICML, Nashville, USA, vol 97, pp 179–186
Fawcett T (2006) An introduction to roc analysis. Pattern Recognit Lett 27(8):861–874
Batuwita R, Palade V (2012) Adjusted geometric-mean: a novel performance measure for imbalanced bioinformatics datasets learning. J Bioinform Comput Biol 10(04):1250003
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64
Acknowledgements
The authors would like to thank the funding agencies CNPq, FAPEMIG and CAPES for their financial support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Aurelio, Y.S., de Almeida, G.M., de Castro, C.L. et al. Learning from Imbalanced Data Sets with Weighted Cross-Entropy Function. Neural Process Lett 50, 1937–1949 (2019). https://doi.org/10.1007/s11063-018-09977-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-018-09977-1