Keywords

1 Introduction

The most of classification data sets do not have exactly the same number of instances for each class. The serious problem appears when the data is significantly imbalanced, when one of the classes quantitatively dominates over other ones. Data imbalance may have a negative effect on the efficiency of standard classification algorithms, thus this problem requires a proper approach and the use of dedicated techniques [10].

Data pre-processing is one of the solutions to this problem. The operation of such methods consists in the prior preparation of data, where algorithm equalizes the number of instances in particular classes. There are tree main approaches: creating additional objects in the minority class (oversampling), removing objects in the majority class (undersampling), or combine both techniques.

Another approach focuses on designing algorithms dedicated to imbalanced data analysis. Appropriate procedures and modifications change the operation of standard classifiers. The introduced changes are aimed at improving the obtained classification quality by increasing the weight of minority data in the imbalanced set. One of such modifications is the feature selection. Lee et al. [12] presented an classifier with feature selection based on fuzzy entropy for pattern classification. The pattern space is partitioned into non-overlapping fuzzy decision regions. It allows to obtain smooth boundaries and achieve better classification performance. Koziarski et al. [9] came up with idea of feature selection based ensemble learning method, which creates subspaces in guided incremental manner according to parameters setting. One of the parameters decides about preference toward feature quality or diversity. Canuto and Nascimento [4] proposed the genetic-based feature selection method to create ensemble classifiers. They use an hybrid and adaptive fitness function consists of the features correlation measure and the ensemble accuracy. Du et al. [6] invented an feature selection based method for imbalanced data with genetic algorithm. It improves the quality of a single model by using geometric mean score as fitness function. Haque et al. [8] proposed an genetic algorithm method for learning heterogeneous ensembles on imbalanced data. This idea is based on the search for possible combinations of different types of classifiers in the ensemble with selecting the best compositions using genetic algorithm with the Matthews correlation coefficient score as the fitness function.

In this work, we present the Genetic Ensemble Selection (ges) classifier, being a novel ensemble learning method which trains diverse individuals on the basis of supervised selected features using a genetic algorithm. Base genetic approach is extended by an original regularisation component to the learning criterion, aiming to ensure a high diversity of the ensemble and to protect it against the hazard of overfitting. The experiments conducted on the collection of benchmark datasets prove the validity and a good performance of the proposed method. The solution refers to the method proposed by Ksieniewicz and Woźniak [11] who applied random search to find the best ensemble in the pool using (balanced accuracy score) [3]. Following work expands research on this idea by implementing a genetic algorithm to search optimal subspaces of features and modifies the proposed regularisation criterion for alternatives solutions.

In a nutshell, the main contributions of this works state as follows:

  • Proposition of a genetic-based classifier ensemble forming algorithm employing feature selection.

  • Proposition of an original fitness function consisting of a classification performance metric with the regularisation component.

  • Evaluation of the proposed method on the basis of the wide range benchmark datasets.

2 Proposed Method

The aim of the classification algorithm is to assign an observed object to one of the predefined set of labels \(\mathcal {M}=\{1,2,...,M\}\). The decision is made on the basis of an analysis of the selected attributes, which are gathered in the feature vector x

$$\begin{aligned} x=[x^{1},x^{2},\dots ,x^{d}]^{T}\in \mathcal {X}. \end{aligned}$$
(1)

Let us define a pool of individual classifiers as

$$\begin{aligned} \varPi =\{\varPsi _{1},\varPsi _{2},\dots ,\varPsi _{K}\}, \end{aligned}$$
(2)

where \(\varPsi _{k}\) denotes the k-th elementary classifier.

In this work the individuals use subsets of attributes to ensure the high diversity of the ensemble [14], therefore let’s propose the following representation of \(\varPi \) as a bit word

$$\begin{aligned} \varPi =\left[ \left[ b_1^1,b_1^2,...,b_1^d\right] \left[ b_2^1,b_2^2,...,b_2^d\right] ...\left[ b_K^1,b_K^2,...,b_K^d\right] \right] , \end{aligned}$$
(3)

where \(b_i^j\) denotes if the jth feature is used by the ith classifier.

figure a

As the optimisation criterion we propose

$$\begin{aligned} Q(\varPi )=metric(\varPi )-\alpha *\frac{no-features(\varPi )}{d}+\beta *\frac{av-Hamming(\varPi )}{d}, \end{aligned}$$
(4)

where \(metric(\varPi )\) denotes value of the chosen performance metric, \(no-features(\varPi )\) is the number of features used by all classifiers from \(\varPi \) and \(av-Hamming(\varPi )\) stands for the average Hamming distance between the words represented individuals in \(\varPi \). Procedure to count the criterion is shown at Algorithm 1.

We employ the genetic algorithm for finding the optimal ensemble line-up. The pseudocode of ges (Genetic Ensemble Selection) procedure is presented in Algorithm 2. Let’s shortly describe its main steps.

figure b

Initialisation Procedure. Algorithm 2 starts with the randomly generated population of the ensembles \(Population=\{\varPi _1, \varPi _2, ...., \varPi _S\}\). Its size is an arbitrarily chosen by an input parameter. Basically, the larger the population exists, the more comprehensive optimisation is performed.

Evaluate Population Procedure. Individuals in the population are evaluated by criterion 4 calculated on the basis of algorithm (Eq. 1) using samples stored in the training set. Obtained results are used to select the elite (the best evaluated ensemble), being immune to mutation or replacement by crossover.

Genetic Operators. The mutation shall inject some randomness into chromosomes. We used a simple bit mutation (flipping random bits in a binary word). The crossover operator exchanges random member classifiers between two randomly chosen parent individuals to form child chromosomes, using standard two-point crossover operator [2].

Selection and Reproduction. To maintain the diversity of the population, which is essential for ensuring the ability to explore the space of possible solution of the problem effectively, the selection process has to be realised in probabilistic manner i.e., not only are the best chromosomes chosen, but also the probability of selection is straight proportional to their evaluation according to the criterion 4. To meet this assumption, a standard ranking selection procedure was implemented, where \(S-1\). Generated individuals and the elite individual are joined together and form offspring population.

3 Experimental Evaluation

Conducting appropriate experiments and analysis of the results should allow us to verify the idea and check the quality of the classification for the proposed method. This section describes the set-up of the conducted experiments, presents their results and analysis. During the experiments we would like to verify the following research hypotheses:

  • Does the proposed approach improve the classification quality?

  • Do the adjusted regularisation methods have a positive effect on the operation of the classifier?

  • Is the random subspace an adequate solution for the feature selection for this classifier proposal?

3.1 Set-Up

All experiments were carried out using benchmark binary-class imbalanced data sets, which are posted and publicly available on the keel repository [1]. Selected data sets have a different imbalance ratios (from balanced problems to 1:33 proportion) and the various number of features. The tests were carried out on the basis of 5-fold stratified cross validation. This approach allows us a reliable check of the proposed method. The Friedman ranking test [5] was used for assessing the ranks of classifiers over all examined benchmarks. Checks whether the assigned rank differs significantly from assigning to each classifier an average rank. In the case if the hypothesis of ranking equality is rejected, the Shaffer post hoc test [7] is used to find out which of the tested methods are distinctive among an n \(\times \) n comparison. The post hoc procedure was based on the significance level \(\alpha =0.05\).

The implementation was done using according to guidelines of scikit-learn library api [13]. Logistic Regression and Gaussian Naive Bayes were used as the base classifiers and the tests were carried out for the comparative analysis of different approaches of ges building classifier ensembles on them:

  • ges—base Genetic Ensemble Selection classifier without regularization,

  • ges-aGenetic Ensemble Selection with alpha-regularization,

  • ges-bGenetic Ensemble Selection with both alpha- and beta-regularization.

Each comparison was supplemented with a bare-base method trained on the whole set of the features and a Random Subspace (rs) approach. All 10 approaches have been tested and evaluated using three metrics F-score, Geometric Mean Score and Balanced accuracy score [3], where metric used for evaluation is always used also as a part of optimization criterion Q. The entire implementation and code allowing to repeat all the experiments has been placed in a publicly available repositoryFootnote 1.

3.2 Results

The first experiment was aiming to identify the best hyperparameters for the genetic algorithm implemented within the ges. Parameters that have been tested include the probability of crossing individuals and the probability of mutation.

Conducted research allowed to indicate the best parameters value in the range from 0% to 10% for crossing and in the range from 0% to 2% for mutation. Figure 1 shows the visualisation of the obtained learning curves for two exemplary datasets. The intensity of the green tells us about the average value of the learning curve in the course. The height of the dot on the right and the result above each square is the quality on the test set of the already learned model. In addition, the intensity of the red of this dot tells whether it is among the worst or the best results. After analysis of all the results, the optimal parameters were selected as 2.5% for crossover and 1% for mutation. These hyperparameter settings were later used to carry out further experiments.

Fig. 1.
figure 1

Crossing and mutation probability influence on BAC.

The next experiment was to find the best regularization parameters \(\alpha \) and \(\beta \). Parameters were tested in the range from 0 to 1 with and step of 0.125. Figure 2 shows the visualisations of the results obtained on exemplary datasets. The intensity of the yellow tells us about the average value of the learning curve in the course. The height of the dot on the right and the result above each square is the quality on the test set of the already learned model. In addition, the intensity of the red of this dot tells whether it is among the worst or the best results. The analysis of the obtained results allows to find best parameters.

Fig. 2.
figure 2

Example of alpha and beta influence on BAC.

The main experiment was aimed at conducting research on data sets after optimising the parameters of the proposed method. Tables 1, 2 and 3 shows the obtained results. Each table has been divided into two parts, for both employed base classifiers. The ranking tests outcome is presented in Table 4 for Logistic Regression as a base classifier and Table 5 for Gaussian Naive Bayes.

Table 1. Overview of the results of the main experiment for F-score.
Table 2. Overview of the results of the main experiment for Gmean score.
Table 3. Overview of the results of the main experiment for Balanced accuracy score

3.3 Analysis

The scores obtained show that the proposed ges method has a certain potential. Ranking tests indicate that ges and its variants have much better performance than the random subspace or base classifier method. ges is able to choose the most favourable ensemble of subspaced classifiers, which gives a significant advantage to rs. On the other side, learning using this method is computationally complex process and it can not be applied to problems where the classifier’s learning speed is an important factor. In response to hypotheses of experiments:

  • Results shows that features selection with the genetic algorithm approach can improve imbalanced data classification quality.

  • Regularisation methods have no significant impact on the classifier’s operation. This is indicated by good performance and lack of statistical difference between base ges and its variants—ges-a and ges-b where regularisation is reduced to one metric or totally excluded.

  • Classification based only on Random Subspace gives quite poor performance. Employment of Random Subspace to select features along with the optimised selection allows a significant improvement compared to the classification on the full feature space.

Table 4. Shaffer test (\(\alpha =0.05\)) for comparison between the proposed methods (GMean). Shaffer’s procedure rejects those hypotheses that have an unadjusted p-value \(\le 0.005\). Symbol ‘✗’ stands for classifiers without significant differences, ‘✓’ for situation in which the method on the left is superior.
Table 5. Shaffer test (\(\alpha =0.05\)) for comparison between the proposed methods (GMean). Shaffer’s procedure rejects those hypotheses that have an unadjusted p-value \(\le 0.005\). Symbol ‘✗’ stands for classifiers without significant differences, ‘✓’ for situation in which the method on the left is superior.

4 Conclusions and Future Directions

Presented work focuses on an innovative ensemble method (ges) dedicated for imbalanced data classification. It is based on the genetic algorithm aiming to optimise selection of feature combinations. Conducted experimental evaluation shows the superiority of ges compared to both the basic random selection of features as well as to the base classifiers used to build ensembles learned on a full feature space.

An important aspect of analysis was the inclusion of regularisation component in fitness function, aiming to ensure the diversity of the classifiers committee. Experiments have shown that introducing it does not give statistically significant improvement in results, so in order to further develop of this method, a different regularisation proposal should be developed. New regularisation should give a greater impact on the classifier operation and better performance of the classification. The proposed method can be considered a useful and effective tool in the classification of data with varying degrees of imbalance.