Abstract
Imbalanced data classification is still a focus of intense research, due to its ever-growing presence in the real-life decision tasks. In this article, we focus on a classifier ensemble for imbalanced data classification. The ensemble is formed on the basis of the individual classifiers trained on supervise-selected feature subsets. There are several methods employing this concept to ensure a high diverse ensemble, nevertheless most of them, as Random Subspace or Random Forest, select attributes for a particular classifier randomly. The main drawback of mentioned methods is not giving the ability to supervise and control this task. In following work, we apply a genetic algorithm to the considered problem. Proposition formulates an original learning criterion, taking into consideration not only the overall classification performance but also ensures that trained ensemble is characterised by high diversity. The experimental study confirmed the high efficiency of the proposed algorithm and its superiority to other ensemble forming method based on random feature selection.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The most of classification data sets do not have exactly the same number of instances for each class. The serious problem appears when the data is significantly imbalanced, when one of the classes quantitatively dominates over other ones. Data imbalance may have a negative effect on the efficiency of standard classification algorithms, thus this problem requires a proper approach and the use of dedicated techniques [10].
Data pre-processing is one of the solutions to this problem. The operation of such methods consists in the prior preparation of data, where algorithm equalizes the number of instances in particular classes. There are tree main approaches: creating additional objects in the minority class (oversampling), removing objects in the majority class (undersampling), or combine both techniques.
Another approach focuses on designing algorithms dedicated to imbalanced data analysis. Appropriate procedures and modifications change the operation of standard classifiers. The introduced changes are aimed at improving the obtained classification quality by increasing the weight of minority data in the imbalanced set. One of such modifications is the feature selection. Lee et al. [12] presented an classifier with feature selection based on fuzzy entropy for pattern classification. The pattern space is partitioned into non-overlapping fuzzy decision regions. It allows to obtain smooth boundaries and achieve better classification performance. Koziarski et al. [9] came up with idea of feature selection based ensemble learning method, which creates subspaces in guided incremental manner according to parameters setting. One of the parameters decides about preference toward feature quality or diversity. Canuto and Nascimento [4] proposed the genetic-based feature selection method to create ensemble classifiers. They use an hybrid and adaptive fitness function consists of the features correlation measure and the ensemble accuracy. Du et al. [6] invented an feature selection based method for imbalanced data with genetic algorithm. It improves the quality of a single model by using geometric mean score as fitness function. Haque et al. [8] proposed an genetic algorithm method for learning heterogeneous ensembles on imbalanced data. This idea is based on the search for possible combinations of different types of classifiers in the ensemble with selecting the best compositions using genetic algorithm with the Matthews correlation coefficient score as the fitness function.
In this work, we present the Genetic Ensemble Selection (ges) classifier, being a novel ensemble learning method which trains diverse individuals on the basis of supervised selected features using a genetic algorithm. Base genetic approach is extended by an original regularisation component to the learning criterion, aiming to ensure a high diversity of the ensemble and to protect it against the hazard of overfitting. The experiments conducted on the collection of benchmark datasets prove the validity and a good performance of the proposed method. The solution refers to the method proposed by Ksieniewicz and Woźniak [11] who applied random search to find the best ensemble in the pool using (balanced accuracy score) [3]. Following work expands research on this idea by implementing a genetic algorithm to search optimal subspaces of features and modifies the proposed regularisation criterion for alternatives solutions.
In a nutshell, the main contributions of this works state as follows:
-
Proposition of a genetic-based classifier ensemble forming algorithm employing feature selection.
-
Proposition of an original fitness function consisting of a classification performance metric with the regularisation component.
-
Evaluation of the proposed method on the basis of the wide range benchmark datasets.
2 Proposed Method
The aim of the classification algorithm is to assign an observed object to one of the predefined set of labels \(\mathcal {M}=\{1,2,...,M\}\). The decision is made on the basis of an analysis of the selected attributes, which are gathered in the feature vector x
Let us define a pool of individual classifiers as
where \(\varPsi _{k}\) denotes the k-th elementary classifier.
In this work the individuals use subsets of attributes to ensure the high diversity of the ensemble [14], therefore let’s propose the following representation of \(\varPi \) as a bit word
where \(b_i^j\) denotes if the jth feature is used by the ith classifier.
As the optimisation criterion we propose
where \(metric(\varPi )\) denotes value of the chosen performance metric, \(no-features(\varPi )\) is the number of features used by all classifiers from \(\varPi \) and \(av-Hamming(\varPi )\) stands for the average Hamming distance between the words represented individuals in \(\varPi \). Procedure to count the criterion is shown at Algorithm 1.
We employ the genetic algorithm for finding the optimal ensemble line-up. The pseudocode of ges (Genetic Ensemble Selection) procedure is presented in Algorithm 2. Let’s shortly describe its main steps.
Initialisation Procedure. Algorithm 2 starts with the randomly generated population of the ensembles \(Population=\{\varPi _1, \varPi _2, ...., \varPi _S\}\). Its size is an arbitrarily chosen by an input parameter. Basically, the larger the population exists, the more comprehensive optimisation is performed.
Evaluate Population Procedure. Individuals in the population are evaluated by criterion 4 calculated on the basis of algorithm (Eq. 1) using samples stored in the training set. Obtained results are used to select the elite (the best evaluated ensemble), being immune to mutation or replacement by crossover.
Genetic Operators. The mutation shall inject some randomness into chromosomes. We used a simple bit mutation (flipping random bits in a binary word). The crossover operator exchanges random member classifiers between two randomly chosen parent individuals to form child chromosomes, using standard two-point crossover operator [2].
Selection and Reproduction. To maintain the diversity of the population, which is essential for ensuring the ability to explore the space of possible solution of the problem effectively, the selection process has to be realised in probabilistic manner i.e., not only are the best chromosomes chosen, but also the probability of selection is straight proportional to their evaluation according to the criterion 4. To meet this assumption, a standard ranking selection procedure was implemented, where \(S-1\). Generated individuals and the elite individual are joined together and form offspring population.
3 Experimental Evaluation
Conducting appropriate experiments and analysis of the results should allow us to verify the idea and check the quality of the classification for the proposed method. This section describes the set-up of the conducted experiments, presents their results and analysis. During the experiments we would like to verify the following research hypotheses:
-
Does the proposed approach improve the classification quality?
-
Do the adjusted regularisation methods have a positive effect on the operation of the classifier?
-
Is the random subspace an adequate solution for the feature selection for this classifier proposal?
3.1 Set-Up
All experiments were carried out using benchmark binary-class imbalanced data sets, which are posted and publicly available on the keel repository [1]. Selected data sets have a different imbalance ratios (from balanced problems to 1:33 proportion) and the various number of features. The tests were carried out on the basis of 5-fold stratified cross validation. This approach allows us a reliable check of the proposed method. The Friedman ranking test [5] was used for assessing the ranks of classifiers over all examined benchmarks. Checks whether the assigned rank differs significantly from assigning to each classifier an average rank. In the case if the hypothesis of ranking equality is rejected, the Shaffer post hoc test [7] is used to find out which of the tested methods are distinctive among an n \(\times \) n comparison. The post hoc procedure was based on the significance level \(\alpha =0.05\).
The implementation was done using according to guidelines of scikit-learn library api [13]. Logistic Regression and Gaussian Naive Bayes were used as the base classifiers and the tests were carried out for the comparative analysis of different approaches of ges building classifier ensembles on them:
-
ges—base Genetic Ensemble Selection classifier without regularization,
-
ges-a—Genetic Ensemble Selection with alpha-regularization,
-
ges-b—Genetic Ensemble Selection with both alpha- and beta-regularization.
Each comparison was supplemented with a bare-base method trained on the whole set of the features and a Random Subspace (rs) approach. All 10 approaches have been tested and evaluated using three metrics F-score, Geometric Mean Score and Balanced accuracy score [3], where metric used for evaluation is always used also as a part of optimization criterion Q. The entire implementation and code allowing to repeat all the experiments has been placed in a publicly available repositoryFootnote 1.
3.2 Results
The first experiment was aiming to identify the best hyperparameters for the genetic algorithm implemented within the ges. Parameters that have been tested include the probability of crossing individuals and the probability of mutation.
Conducted research allowed to indicate the best parameters value in the range from 0% to 10% for crossing and in the range from 0% to 2% for mutation. Figure 1 shows the visualisation of the obtained learning curves for two exemplary datasets. The intensity of the green tells us about the average value of the learning curve in the course. The height of the dot on the right and the result above each square is the quality on the test set of the already learned model. In addition, the intensity of the red of this dot tells whether it is among the worst or the best results. After analysis of all the results, the optimal parameters were selected as 2.5% for crossover and 1% for mutation. These hyperparameter settings were later used to carry out further experiments.
The next experiment was to find the best regularization parameters \(\alpha \) and \(\beta \). Parameters were tested in the range from 0 to 1 with and step of 0.125. Figure 2 shows the visualisations of the results obtained on exemplary datasets. The intensity of the yellow tells us about the average value of the learning curve in the course. The height of the dot on the right and the result above each square is the quality on the test set of the already learned model. In addition, the intensity of the red of this dot tells whether it is among the worst or the best results. The analysis of the obtained results allows to find best parameters.
The main experiment was aimed at conducting research on data sets after optimising the parameters of the proposed method. Tables 1, 2 and 3 shows the obtained results. Each table has been divided into two parts, for both employed base classifiers. The ranking tests outcome is presented in Table 4 for Logistic Regression as a base classifier and Table 5 for Gaussian Naive Bayes.
3.3 Analysis
The scores obtained show that the proposed ges method has a certain potential. Ranking tests indicate that ges and its variants have much better performance than the random subspace or base classifier method. ges is able to choose the most favourable ensemble of subspaced classifiers, which gives a significant advantage to rs. On the other side, learning using this method is computationally complex process and it can not be applied to problems where the classifier’s learning speed is an important factor. In response to hypotheses of experiments:
-
Results shows that features selection with the genetic algorithm approach can improve imbalanced data classification quality.
-
Regularisation methods have no significant impact on the classifier’s operation. This is indicated by good performance and lack of statistical difference between base ges and its variants—ges-a and ges-b where regularisation is reduced to one metric or totally excluded.
-
Classification based only on Random Subspace gives quite poor performance. Employment of Random Subspace to select features along with the optimised selection allows a significant improvement compared to the classification on the full feature space.
4 Conclusions and Future Directions
Presented work focuses on an innovative ensemble method (ges) dedicated for imbalanced data classification. It is based on the genetic algorithm aiming to optimise selection of feature combinations. Conducted experimental evaluation shows the superiority of ges compared to both the basic random selection of features as well as to the base classifiers used to build ensembles learned on a full feature space.
An important aspect of analysis was the inclusion of regularisation component in fitness function, aiming to ensure the diversity of the classifiers committee. Experiments have shown that introducing it does not give statistically significant improvement in results, so in order to further develop of this method, a different regularisation proposal should be developed. New regularisation should give a greater impact on the classifier operation and better performance of the classification. The proposed method can be considered a useful and effective tool in the classification of data with varying degrees of imbalance.
References
Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Mult. Valued Log. Soft Comput. 17, 255–287 (2011)
Back, T., Fogel, D., Michalewicz, Z.: Handbook of Evolutionary Computation. Oxford University Press, New York (1997)
Branco, P., Torgo, L., Ribeiro, R.P.: Relevance-based evaluation metrics for multi-class imbalanced domains. In: Kim, J., Shim, K., Cao, L., Lee, J.-G., Lin, X., Moon, Y.-S. (eds.) PAKDD 2017. LNCS (LNAI), vol. 10234, pp. 698–710. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57454-7_54
Canuto, A.M., Nascimento, D.S.: A genetic-based approach to features selection for ensembles using a hybrid and adaptive fitness function. In: The 2012 international joint conference on neural networks (IJCNN), pp. 1–8. IEEE (2012)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Du, L., Xu, Y., Jin, L.: Feature selection for imbalanced datasets based on improved genetic algorithm. In: Decision Making and Soft Computing: Proceedings of the 11th International FLINS Conference, pp. 119–124. World Scientific (2014)
García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)
Haque, M.N., Noman, N., Berretta, R., Moscato, P.: Heterogeneous ensemble combination search using genetic algorithm for class imbalanced data classification. PloS One 11(1), e0146116 (2016)
Koziarski, M., Krawczyk, B., Woźniak, M.: The deterministic subspace method for constructing classifier ensembles. Pattern Anal. Appl. 20(4), 981–990 (2017)
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016). https://doi.org/10.1007/s13748-016-0094-0
Ksieniewicz, P., Woźniak, M.: Imbalanced data classification based on feature selection techniques. In: Yin, H., Camacho, D., Novais, P., Tallón-Ballesteros, A.J. (eds.) IDEAL 2018. LNCS, vol. 11315, pp. 296–303. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03496-2_33
Lee, H.M., Chen, C.M., Chen, J.M., Jou, Y.L.: An efficient fuzzy classifier with feature selection based on fuzzy entropy. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 31(3), 426–432 (2001)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Wozniak, M., Graña, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (2014)
Acknowledgments
This work was supported by the Polish National Science Centre under the grant No. 2017/27/B/ST6/01325 as well as by the statutory funds of the Department of Systems and Computer Networks, Faculty of Electronics, Wroclaw University of Science and Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Klikowski, J., Ksieniewicz, P., Woźniak, M. (2019). A Genetic-Based Ensemble Learning Applied to Imbalanced Data Classification. In: Yin, H., Camacho, D., Tino, P., Tallón-Ballesteros, A., Menezes, R., Allmendinger, R. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2019. IDEAL 2019. Lecture Notes in Computer Science(), vol 11872. Springer, Cham. https://doi.org/10.1007/978-3-030-33617-2_35
Download citation
DOI: https://doi.org/10.1007/978-3-030-33617-2_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33616-5
Online ISBN: 978-3-030-33617-2
eBook Packages: Computer ScienceComputer Science (R0)