An Ensemble Tree Classifier for Highly Imbalanced Data Classification

Shi, Peibei; Wang, Zhong

doi:10.1007/s11424-021-1038-8

An Ensemble Tree Classifier for Highly Imbalanced Data Classification

Published: 26 August 2021

Volume 34, pages 2250–2266, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Systems Science and Complexity Aims and scope Submit manuscript

An Ensemble Tree Classifier for Highly Imbalanced Data Classification

Download PDF

Peibei Shi¹ &
Zhong Wang¹

280 Accesses
3 Citations
Explore all metrics

Abstract

The performance of traditional imbalanced classification algorithms is degraded when dealing with highly imbalanced data. How to deal with highly imbalanced data is a difficult problem. In this paper, the authors propose an ensemble tree classifier for highly imbalanced data classification. The ensemble tree classifier is constructed with a complete binary tree structure. A mathematical model is established based on the features and classification performance of the classifier, and it is proven that the model parameters of the ensemble classifier can be solved by calculation. First, the AdaBoost method is used as the benchmark classifier to construct the tree structure model. Then, the classification cost of the model is calculated, and the quantitative mathematical description between the cost and features of the ensemble tree classifier model is obtained. Then, the cost of the classification model is transformed into an optimization problem, and the parameters of the integrated tree classifier are given through theoretical derivation. This approach is tested on several highly imbalanced datasets in different fields and takes the AUC (area under the curve) and F-measure as evaluation criteria. Compared with the traditional imbalanced classification algorithm, the ensemble tree classifier has better classification performance.

Article PDF

Imbalanced Classification Problems: A Comparative Study of Non-ensemble and Ensemble-Based Approaches

Dealing with Imbalanced Data

HECMI: Hybrid Ensemble Technique for Classification of Multiclass Imbalanced Data

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Wang X M, Hu M, Zhao Y L, et al., Credit scoring based on the set-valued identification method, Journal of Systems Science and Complexity, 2020, 33(5): 1297–1309.
Article Google Scholar
Sun A X, Lim E P, and Liu Y, On strategies for imbalanced text classification using SVM: A comparative study, Decision Support Systems, 2009, 48(1): 191–201.
Article Google Scholar
Xie L, Jia Y L, Xiao J, et al., GMDH-based outlier detection model in classification problems, Journal of Systems Science and Complexity, 2020, 33(5): 1516–1532.
Article Google Scholar
Burez J and Poel D V D, Handling class imbalance in customer churn prediction, Expert Systems with Applications, 2008, 36(3): 4626–4636.
Article Google Scholar
Brekke C and Solberg A H S, Oil spill detection by satellite remote sensing, Remote Sensing of Environment, 2005, 95(1): 1–13.
Article Google Scholar
Plant C, Bhm C, Tilg B, et al., Enhancing instance-based classification with local density: A new algorithm for classifying unbalanced biomedical data, Bioinformatics, 2006, 22(8): 981–988.
Article Google Scholar
Chen J D and Tang X J, The distributed representation for societal risk classification toward BBS posts, Journal of Systems Science and Complexity, 2017, 30(3): 113–130.
Google Scholar
Song Q B, Guo Y C, and Shepperd M, A comprehensive investigation of the role of imbalanced learning for software defect prediction, IEEE Transactions on Software Engineering, 2019, 45(12): 1253–1269.
Article Google Scholar
Chawla N V, Bowyer K W, Hall L O, et al., SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 2002, 16(1): 321–357.
Article Google Scholar
Hui H, Wang W Y, and Mao B H, Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, Proceedings of the 2005 International Conference on Advances in Intelligent Computing, 2005, 878–887.
Loyola-Gonzlez O, Garca-Borroto M, Medina-Prez M A, et al., An empirical study of oversampling and undersampling methods for LCMine an emerging pattern based classifier, Mexican Conference on Pattern Recognition, 2013, 264–273.
Batista G E, Prati R C, and Monard M C, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20–29.
Article Google Scholar
Castro C L and Braga A P, Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data, IEEE Transactions on Neural Networks and Learning Systems, 2013, 24(6): 888–899.
Article Google Scholar
Thai-Nghe N, Gantner Z, and Schmidt-Thieme L, Cost-sensitive learning methods for imbalanced data, International Joint Conference on Neural Networks, 2010, 1–8.
Raskutti B and Kowalczyk A, Extreme re-balancing for SVMs: A case study, ACM Sigkdd Explorations Newsletter, 2004, 6(1): 60–69.
Article Google Scholar
Juszczak P and Duin R P W, Uncertainty sampling methods for one-class classifiers, Proceedings of ICML-03, Workshop on Learning with Imbalanced Data Sets II, 2003, 81–88.
Chen Z, Duan J, Kang L, et al., A hybrid data-level ensemble to enable learning from highly imbalanced dataset, Information Sciences, 2021, 554: 157–176.
Article MathSciNet Google Scholar
Yang P Y, Yoo P D, Fernando J, et al., Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics Applications, IEEE Transactions on Cybernetics, 2014, 44(3): 445–455.
Article Google Scholar
Ando S and Huang C Y, Deep over-sampling framework for classifying imbalanced data, ECML PKDD, 2017, 770–785.
Zhang C, Tan K C, and Ren R, Training cost-sensitive deep belief networks on imbalance data problems, International Joint Conference on Neural Networks, 2016, 4362–4367.
Hu J L, Lu J W, Tan Y P, et al., Deep transfer metric learning, IEEE Conference on Computer Vision and Pattern Recognition, 2015, 325–333.
Dong Q, Gong S G, and Zhu X T, Class rectification hard mining for imbalanced deep learning, IEEE International Conference on Computer Vision, 2017, 1869–1878.
Sahbi H and Geman D, A hierarchy of support vector machines for pattern detection, The Journal of Machine Learning Research, 2006, 7: 2087–2123.
MathSciNet MATH Google Scholar
Viola P and Jones M, Robust real-time object detection, International Journal of Computer Vision, 2003, 57(2): 137–154.
Article Google Scholar
Zheng Z Y, Cai Y P, and Li Y, Oversampling method for imbalanced classification, Computing and Informatics, 2016, 34(5): 1017–1037.
Google Scholar
Triguero I, Galar M, Vluymans S, et al., Evolutionary undersampling for imbalanced big data classification, IEEE Congress on Evolutionary Computation (CEC), 2015, 715–722.
Domingos P, MetaCost: A general method for making classifiers cost-sensitive, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, 155–164.
Chen C, Liaw A, and Breiman L, Using random forest to learn imbalanced data, No. 666, Statistics Department, University of California at Berkeley, 2004.
Chew H G, Bogner R E, and Lim C C, Dual v-support vector machine with error rate and training size biasing, IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001, 2: 1269–1272.
Google Scholar
Huang K H and Lin H T, Cost-sensitive label embedding for multi-label classification, Machine Learning, 2017, 106(9–10): 1725–1746.
Article MathSciNet Google Scholar
Lu H J, Yang L, Yan K, et al., A cost-sensitive rotation forest algorithm for gene expression data classification, Neurocomputing, 2016, 228: 270–276.
Article Google Scholar
Ayyagari M R, Classification of imbalanced datasets using one-class SVM, k-nearest neighbors and CART algorithm, International Journal of Advanced Computer Science and Applications, 2020, 11(11): 1–5.
Article Google Scholar
Zhou Z H and Liu X Y, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 63–77.
Article Google Scholar
Liu X Y, Wu J, and Zhou Z H, Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2009, 39(2): 539–550.
Article Google Scholar
Galar M, Fernandez A, Barrenechea E, et al., A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(4): 463–484.
Article Google Scholar
Wang S, Minku L L, and Yao X, Resampling-based ensemble methods for online class imbalance learning, IEEE Transactions on Knowledge and Data Engineering, 2015, 27(5): 1356–1368.
Article Google Scholar
Dubey R, Zhou J Y, Wang Y L, et al., Analysis of sampling techniques for imbalanced data: An n = 648 ADNI study, Neuroimage, 2014, 87: 220–241.
Article Google Scholar
Jeatrakul P, Wong K W, and Fung C C, Classification of imbalanced data by combining the complementary neural network and SMOTE algorithm, 17th International Conference on Neural Information Processing, 2010, 152–159.
Yan Y L, Chen M, Shyu M L, et al., Deep learning for imbalanced multimedia data classification, IEEE International Symposium on Multimedia, 2016, 483–488.
Huang C, Li Y N, Loy C C, et al., Learning deep representation for imbalanced classification, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, 5375–5384.
Khan S H, Hayat M, Bennamoun M, et al., Cost sensitive learning of deep feature representations from imbalanced data, IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(8): 3573–3587.
Article Google Scholar
Dong Q, Gong S G, and Zhu X T, Imbalanced deep learning by minority class incremental Rectification, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(6): 1367–1381.
Article Google Scholar
Cao X B, Qiao H, and Keane J, A low-cost pedestrian-detection system with a single optical camera, IEEE Transactions on Intelligent Transportation Systems, 2008, 9(1): 58–67.
Article Google Scholar
Liu X Y, Li Q Q, and Zhou Z H, Learning imbalanced multi-class data with optimal dichotomy weights, International Conference on Data Mining, 2013, 478–487.

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Hefei Normal University, Hefei, 230601, China
Peibei Shi & Zhong Wang

Authors

Peibei Shi
View author publications
You can also search for this author in PubMed Google Scholar
Zhong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhong Wang.

Additional information

This work was supported by the National Natural Science Foundation of China under Grant No. 61976198, the Natural Science Research Key Project for Colleges and Universities of Anhui Province under Grant No. KJ2019A0726, and the High-level Scientific Research Foundation for the Introduction of Talent of Hefei Normal University under Grant No. 2020RCJJ44.

This paper was recommended for publication by Editor ZHANG Xinyu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shi, P., Wang, Z. An Ensemble Tree Classifier for Highly Imbalanced Data Classification. J Syst Sci Complex 34, 2250–2266 (2021). https://doi.org/10.1007/s11424-021-1038-8

Download citation

Received: 22 February 2021
Revised: 21 May 2021
Published: 26 August 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s11424-021-1038-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An Ensemble Tree Classifier for Highly Imbalanced Data Classification

Abstract

Article PDF

Similar content being viewed by others

Imbalanced Classification Problems: A Comparative Study of Non-ensemble and Ensemble-Based Approaches

Dealing with Imbalanced Data

HECMI: Hybrid Ensemble Technique for Classification of Multiclass Imbalanced Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Ensemble Tree Classifier for Highly Imbalanced Data Classification

Abstract

Article PDF

Similar content being viewed by others

Imbalanced Classification Problems: A Comparative Study of Non-ensemble and Ensemble-Based Approaches

Dealing with Imbalanced Data

HECMI: Hybrid Ensemble Technique for Classification of Multiclass Imbalanced Data

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation