Abstract
The performance of traditional imbalanced classification algorithms is degraded when dealing with highly imbalanced data. How to deal with highly imbalanced data is a difficult problem. In this paper, the authors propose an ensemble tree classifier for highly imbalanced data classification. The ensemble tree classifier is constructed with a complete binary tree structure. A mathematical model is established based on the features and classification performance of the classifier, and it is proven that the model parameters of the ensemble classifier can be solved by calculation. First, the AdaBoost method is used as the benchmark classifier to construct the tree structure model. Then, the classification cost of the model is calculated, and the quantitative mathematical description between the cost and features of the ensemble tree classifier model is obtained. Then, the cost of the classification model is transformed into an optimization problem, and the parameters of the integrated tree classifier are given through theoretical derivation. This approach is tested on several highly imbalanced datasets in different fields and takes the AUC (area under the curve) and F-measure as evaluation criteria. Compared with the traditional imbalanced classification algorithm, the ensemble tree classifier has better classification performance.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Wang X M, Hu M, Zhao Y L, et al., Credit scoring based on the set-valued identification method, Journal of Systems Science and Complexity, 2020, 33(5): 1297–1309.
Sun A X, Lim E P, and Liu Y, On strategies for imbalanced text classification using SVM: A comparative study, Decision Support Systems, 2009, 48(1): 191–201.
Xie L, Jia Y L, Xiao J, et al., GMDH-based outlier detection model in classification problems, Journal of Systems Science and Complexity, 2020, 33(5): 1516–1532.
Burez J and Poel D V D, Handling class imbalance in customer churn prediction, Expert Systems with Applications, 2008, 36(3): 4626–4636.
Brekke C and Solberg A H S, Oil spill detection by satellite remote sensing, Remote Sensing of Environment, 2005, 95(1): 1–13.
Plant C, Bhm C, Tilg B, et al., Enhancing instance-based classification with local density: A new algorithm for classifying unbalanced biomedical data, Bioinformatics, 2006, 22(8): 981–988.
Chen J D and Tang X J, The distributed representation for societal risk classification toward BBS posts, Journal of Systems Science and Complexity, 2017, 30(3): 113–130.
Song Q B, Guo Y C, and Shepperd M, A comprehensive investigation of the role of imbalanced learning for software defect prediction, IEEE Transactions on Software Engineering, 2019, 45(12): 1253–1269.
Chawla N V, Bowyer K W, Hall L O, et al., SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 2002, 16(1): 321–357.
Hui H, Wang W Y, and Mao B H, Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, Proceedings of the 2005 International Conference on Advances in Intelligent Computing, 2005, 878–887.
Loyola-Gonzlez O, Garca-Borroto M, Medina-Prez M A, et al., An empirical study of oversampling and undersampling methods for LCMine an emerging pattern based classifier, Mexican Conference on Pattern Recognition, 2013, 264–273.
Batista G E, Prati R C, and Monard M C, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20–29.
Castro C L and Braga A P, Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data, IEEE Transactions on Neural Networks and Learning Systems, 2013, 24(6): 888–899.
Thai-Nghe N, Gantner Z, and Schmidt-Thieme L, Cost-sensitive learning methods for imbalanced data, International Joint Conference on Neural Networks, 2010, 1–8.
Raskutti B and Kowalczyk A, Extreme re-balancing for SVMs: A case study, ACM Sigkdd Explorations Newsletter, 2004, 6(1): 60–69.
Juszczak P and Duin R P W, Uncertainty sampling methods for one-class classifiers, Proceedings of ICML-03, Workshop on Learning with Imbalanced Data Sets II, 2003, 81–88.
Chen Z, Duan J, Kang L, et al., A hybrid data-level ensemble to enable learning from highly imbalanced dataset, Information Sciences, 2021, 554: 157–176.
Yang P Y, Yoo P D, Fernando J, et al., Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics Applications, IEEE Transactions on Cybernetics, 2014, 44(3): 445–455.
Ando S and Huang C Y, Deep over-sampling framework for classifying imbalanced data, ECML PKDD, 2017, 770–785.
Zhang C, Tan K C, and Ren R, Training cost-sensitive deep belief networks on imbalance data problems, International Joint Conference on Neural Networks, 2016, 4362–4367.
Hu J L, Lu J W, Tan Y P, et al., Deep transfer metric learning, IEEE Conference on Computer Vision and Pattern Recognition, 2015, 325–333.
Dong Q, Gong S G, and Zhu X T, Class rectification hard mining for imbalanced deep learning, IEEE International Conference on Computer Vision, 2017, 1869–1878.
Sahbi H and Geman D, A hierarchy of support vector machines for pattern detection, The Journal of Machine Learning Research, 2006, 7: 2087–2123.
Viola P and Jones M, Robust real-time object detection, International Journal of Computer Vision, 2003, 57(2): 137–154.
Zheng Z Y, Cai Y P, and Li Y, Oversampling method for imbalanced classification, Computing and Informatics, 2016, 34(5): 1017–1037.
Triguero I, Galar M, Vluymans S, et al., Evolutionary undersampling for imbalanced big data classification, IEEE Congress on Evolutionary Computation (CEC), 2015, 715–722.
Domingos P, MetaCost: A general method for making classifiers cost-sensitive, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, 155–164.
Chen C, Liaw A, and Breiman L, Using random forest to learn imbalanced data, No. 666, Statistics Department, University of California at Berkeley, 2004.
Chew H G, Bogner R E, and Lim C C, Dual v-support vector machine with error rate and training size biasing, IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001, 2: 1269–1272.
Huang K H and Lin H T, Cost-sensitive label embedding for multi-label classification, Machine Learning, 2017, 106(9–10): 1725–1746.
Lu H J, Yang L, Yan K, et al., A cost-sensitive rotation forest algorithm for gene expression data classification, Neurocomputing, 2016, 228: 270–276.
Ayyagari M R, Classification of imbalanced datasets using one-class SVM, k-nearest neighbors and CART algorithm, International Journal of Advanced Computer Science and Applications, 2020, 11(11): 1–5.
Zhou Z H and Liu X Y, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 63–77.
Liu X Y, Wu J, and Zhou Z H, Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2009, 39(2): 539–550.
Galar M, Fernandez A, Barrenechea E, et al., A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(4): 463–484.
Wang S, Minku L L, and Yao X, Resampling-based ensemble methods for online class imbalance learning, IEEE Transactions on Knowledge and Data Engineering, 2015, 27(5): 1356–1368.
Dubey R, Zhou J Y, Wang Y L, et al., Analysis of sampling techniques for imbalanced data: An n = 648 ADNI study, Neuroimage, 2014, 87: 220–241.
Jeatrakul P, Wong K W, and Fung C C, Classification of imbalanced data by combining the complementary neural network and SMOTE algorithm, 17th International Conference on Neural Information Processing, 2010, 152–159.
Yan Y L, Chen M, Shyu M L, et al., Deep learning for imbalanced multimedia data classification, IEEE International Symposium on Multimedia, 2016, 483–488.
Huang C, Li Y N, Loy C C, et al., Learning deep representation for imbalanced classification, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, 5375–5384.
Khan S H, Hayat M, Bennamoun M, et al., Cost sensitive learning of deep feature representations from imbalanced data, IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(8): 3573–3587.
Dong Q, Gong S G, and Zhu X T, Imbalanced deep learning by minority class incremental Rectification, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(6): 1367–1381.
Cao X B, Qiao H, and Keane J, A low-cost pedestrian-detection system with a single optical camera, IEEE Transactions on Intelligent Transportation Systems, 2008, 9(1): 58–67.
Liu X Y, Li Q Q, and Zhou Z H, Learning imbalanced multi-class data with optimal dichotomy weights, International Conference on Data Mining, 2013, 478–487.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the National Natural Science Foundation of China under Grant No. 61976198, the Natural Science Research Key Project for Colleges and Universities of Anhui Province under Grant No. KJ2019A0726, and the High-level Scientific Research Foundation for the Introduction of Talent of Hefei Normal University under Grant No. 2020RCJJ44.
This paper was recommended for publication by Editor ZHANG Xinyu.
Rights and permissions
About this article
Cite this article
Shi, P., Wang, Z. An Ensemble Tree Classifier for Highly Imbalanced Data Classification. J Syst Sci Complex 34, 2250–2266 (2021). https://doi.org/10.1007/s11424-021-1038-8
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11424-021-1038-8