Keywords

1 Introduction

The purpose of classification is to learn a target function that links a set of data to a set of predefined categories (classes). The constructed target functions are also called classification models.

To address the problem of instability of some classifiers [1, 2], it is asserted that the use of an ensemble classifiers generally gives better results than the use of a singular classification model [3,4,5,6]. These ensembles can be either homogeneous or heterogeneous depending on the nature of the models used. The homogeneous ensembles are obtained by performing different executions of the same learning algorithm; for example these ensemble can be obtained by varying the parameters of the learning algorithm or by manipulating the data (input or output) [2, 7,8,9]. As for heterogeneous models, they are obtained by executing different learning algorithms on the same set of data. These models have different visions of the data which allows obtaining diverse sets.

Generating a large number of predictors allows exploring the solution space largely, and by aggregating all the predictions, we recover a predictor that accounts for all this exploration. The increase in the number of models can be done without risk of over-learning [10]. However, a very large number of models require a large storage space and an important time for the prediction due to the interrogation of all the predictors.

The simplification of ensembles of classifiers, called ensemble selection, reduces the size of ensembles prior to their integration or combination [11,12,13,14,15,16]. In this paper, we use a diversity-based entropy measure combined to a genetic algorithm-based search strategy to simplify an ensemble of heterogeneous classifiers.

The rest of the paper is organized as follows: In Sect. 2, we present the recent works in the ensemble selection field and especially those related to heterogeneous ensembles. Section 3 details the proposed method describing the entropy measure and search strategy used. In Sect. 4, we give the experimentation results and a comparative study with heterogeneous ensemble selection method in literature. Finally in Sect. 5, we conclude and give some future work.

2 Literature Review

Ensemble methods have attracted a growing interest since their appearance. These methods have been applied in several areas namely, statistics, pattern recognition, and machine learning. These methods consist of two main phases: a model production phase and an aggregation or combination phase of these models. In the first phase, variable performance models are added arbitrarily. The reduced performance models negatively affect the ensemble performance. An ensemble may contain several similar models which reduces its diversity.

Several recent works on ensemble selection [3, 17,18,19] have been developed in the literature. Dai [20] proposes an improvement of the ensemble selection method of the same authors. This method uses backtracking in depth, which is perfectly adapted to systematically seek solutions to combinatorial problems of great magnitude. This improvement concerns the response time of this method, which has been considerably improved in this study.

Bhatnagar et al. [21] perform ensemble selection using a performance-based and diversity-based function that considers the individual performance of classifiers as well as the diversity between pairs of classifiers. A bottom-up search is performed to generate the sub ensembles by adding various pairs of classifiers with high performance.

Simplifying a set of classifiers usually involves reducing the number of trees while maximizing performance. Based on the approximate ensembles, Guo et al. [22] propose a new framework for ensemble selection. In this context, the relationship between attributes in an approximate space is considered a priori as well as their degree of maximum dependence. This effectively reduces the search space and increases the diversity of selected sub-ensembles. Finally, to choose the appropriate sub-ensemble, an evaluation function that balances diversity and precision is used. The proposed method allows repetitively changing the search space of the relevant sub-ensembles and selecting the next ensemble from a new search space.

The selection methods were applied for either homogeneous or heterogeneous cases [16, 23]. The heterogeneous ensembles are developed for sensitive application domains. Haque et al. [24] propose a search based on genetic algorithms to find the optimal combination to form a heterogeneous ensemble from an ensemble of classifiers. For this, they develop an algorithm that uses decimal cross-validation on the learning ensemble to evaluate the quality of each candidate ensemble. The proposed method uses a random resampling approach to balance the class distribution and is particularly used for class imbalance cases.

Pölsterl et al. [25] use a heterogeneous ensemble including survival models with the purpose of predicting the survival of patients with stage 3 prostate cancer. The results demonstrate that the constructed ensemble can predict the date of death of patients with this disease.

Partalas et al. [26] propose a diversity measure and a forward selection search strategy that considers all the possible cases that may exist when adding a certain model ht to an ensemble. The measure called FES (Focused Ensemble Selection) considers 4 cases when adding.

$$ \begin{aligned} FES_{Eval} \left( {h_{k} \text{,}\,SUB} \right) & = \sum\nolimits_{i = 1}^{{\left| {Eval} \right|}} {\left( {\alpha \text{ * }I\left( {y_{i} = h_{k} \left( {x_{i} } \right)ETy_{i} \ne Sub\left( {x_{i} } \right)} \right) - \beta \text{ * }I\left( {y_{i} \ne h_{k} \left( {x_{i} } \right)ETy_{i} = Sub\left( {x_{i} } \right)} \right)} \right.} \\ & \quad \left. { + \,\beta \text{ * }I\left( {y_{i} = h_{k} \left( {x_{i} } \right)ETy_{i} = Sub\left( {x_{i} } \right)} \right) - \alpha \text{ * }I\left( {y_{i} \ne h_{k} \left( {x_{i} } \right)ETy_{i} \ne Sub\left( {x_{i} } \right)} \right)} \right)\text{.} \\ \end{aligned} $$

Eval: the sample of evaluation or selection;

\( I\left( {y_{i} = h_{k} \left( {x_{i} } \right)ETy_{i} \ne Sub\left( {x_{i} } \right)} \right) = 1 \) if the instance xi is well classified by the model hk and is not properly classified by the current ensemble Sub and 0 otherwise.

The factors α, β represent respectively the number of models in the Sub ensemble correctly classifying the instance (xi, yi) and the number of models not properly classifying the same instance.

3 The EMnGA Method

3.1 Entropy Measure

The key idea in this approach is to generate only trees that have maximum diversity (they are less correlated with each other). This is based on the principle that the generalization error of the random forest will be on the wane while diversity among the trees increases.

Let ΩV be a sample of individuals with their labels (classes), |ΩV| = n, \( \Omega _{V} = \left\{ {v_{1} , \ldots ,v_{n} } \right\} \), and \( \Omega _{\text{V}} = \left\{ {v_{1} , \ldots ,v_{n} } \right\} \). Each individual vj is described by m variables denoted \( x_{1j} , \ldots ,x_{mj} \). Let Ci be a classifier of the classifiers ensemble \( \left\{ {{\text{C}}_{1} , \ldots ,{\text{C}}_{i} , \ldots ,{\text{C}}_{\text{T}} } \right\} \) represented by a n-dimensional binary vector \( y_{i} = (y_{1i} , \ldots ,y_{ni} )^{\text{T}} \) such that yji=1 if the classifier Ci classifies the individual vj and 0 otherwise. The entropy function fE measures the diversity within an ensemble [27]. Given an individual \( x_{j} { \in \varOmega }_{\,V} \), if half of the classifiers T/2 don’t misclassify xj then the other half T-T/2 misclassifies it necessarily and vice versa. We speak in this case of maximum diversity.

We note nc(xj) the number of classifiers of T which correctly classify xj, \( nc\left( {x_{j} } \right) = \sum\nolimits_{i = 1}^{T} {y_{ij} } \). The entropy measure fE is written as:

$$ f_{E} = \frac{1}{n}\sum\limits_{j = 1}^{n} {\frac{1}{{T - \frac{T}{2}}}\hbox{min} \left\{ {nc\left( {x_{j} } \right),T - nc\left( {x_{j} } \right)} \right\}} $$

\( f_{E} \in \left[ {0,1} \right] \) where 1 indicates a very large diversity and 0 a lack of diversity. Thus, the goal is to maximize the fE function.

3.2 Genetic Algorithms

Genetic algorithms are a preferred technique for selection because they are inspired by natural selection. They generate individuals that optimize an evaluation function also called fitness function.

A genetic algorithm is defined by [28]:

  • Individual also called chromosome or sequence represents a potential solution of the problem. In our case, a solution of the problem corresponds to a binary string of size T (corresponds to the number of trees composing the forest). A chromosome is noted \( ch = (val_{1} val_{2} \ldots val_{T} ) \) where vali = 1 if the tree is present in the selected chromosome and 0 otherwise;

  • Population corresponds to all the chromosomes representing all possibilities of 1 and 0 in a binary chain of size T;

  • Environment represents the search space |ER| = 2T.

  • Fitness function corresponds to the function ffE = fE (fE is the diversity function defined above). The goal is to maximize the value of ffE.

Calculate the ffE fitness function for chromosome ch1 is equivalent to calculate the function fE:

$$ f_{E} = \frac{1}{n}\sum\limits_{j = 1}^{n} {\frac{1}{{T - \frac{T}{2}}}\hbox{min} \left\{ {nc\left( {x_{j} } \right),T - nc\left( {x_{j} } \right)} \right\}} $$

For instance, if n = 2 = |Ωv|, T = 2 (the classifiers to which correspond the value 1 T1 and T3), x1 and x2 are the individuals of Ωv classified respectively (1 0)t and (0 1)t by T1 et T3.

nc(x1) = 1 (the number of trees that correctly classify instances x1).

nc(x2) = 1 (the number of trees that correctly classify instances x2).

\( f_{E} = \frac{1}{2}\left( {\frac{1}{{2 - \frac{2}{2}}}\text{ * }\hbox{min} \left( {1,2, - 1} \right) + \frac{1}{{2 - \frac{2}{2}}}\text{ * }\hbox{min} \left( {1,2, - 1} \right)} \right) = 1 \) and ffE = fE = 1, ffE takes its minimum equal to 0 when the trees are diverse and its maximum 1 when they disagree; hence \( ff_{E} \in \left[ {0\,\,1} \right] \).

We give hereafter the EMnGA algorithm that uses a genetic algorithm-based search strategy and an entropy-based fitness function. It is described by the following pseudo code:

figure a

4 Experiments and Results

In this section, we describe information about the datasets used to carry out our experiments. We experienced 8 benchmarks datasets downloaded from the UCI Repository [29] as depicted in Table 1.

Table 1. Description of datasets

All data sets contain enough data to split them into three samples: a learning sample ΩL (40% of the initial sample size), a validation or selection sample (20% of the size of the initial sample) and the remaining 20% are used for the test.

We adopt a similar approach to that proposed in [26]. We generate, for each dataset, a heterogeneous ensemble of 200 models (classifiers) containing:

  • 60 Kppv: with 22 values for K ranging from 1 to the size of the training sample. Three weighting methods are applied: no weighting, inverse weighting and similarity weighting;

  • 110 SVM: we use 10 values for the complexity parameter {10−7, 10−6, 10−5, 10−4, 10−3, 10−2, 0.1, 1, 10, 100}, and 10 different kernels (2 polynomials of degree 2 and 3 and, 8 radial with gamma ∈ {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2};

  • 2 naive bayes: one model is built with the default parameters and another with a kernel estimate;

  • 4 decision trees: two values are used for the confidence factor {0.25, 0.5};

  • 24 MLPs with a single hidden layer containing 6 neurons.

For each dataset, the generation of the ensemble (composed of 200 models) is repeated 10 times and the majority vote is used for the combination of the different models.

In Table 2, we give the performance results or success rates obtained (calculated on an average of 10 iterations for each dataset) on the 8 datasets. Comparisons are made between the Initial ensemble EI, the method EMnGA, and FES [26].

Table 2. Success rates obtained respectively by EI, EMnGA and FES for the 8 datasets

The EMnGA method gives better performance compared to FES on 5 of the 8 datasets, with improvements ranging from 0.5% to 10.92%. FES is doing better than the proposed method on 3 dataset with improvements of 0.45% for kr-vs-kp and segment, and 1.27% for tick-to-toe. On average, across all datasets, EMnGA improves the performance of FES by 3% and the initial ensemble EI by 12% (Table 3).

Table 3. Sizes of ensembles obtained by EMnGA and FES on the 8 datasets

We note that FES method generate ensembles with reduced sizes in 6 cases among the 8 with reductions ranging from 0.2 to 2.3. EMnGA reduces the size of the credit-g and tick-to-toe datasets by 0.1 and 1.8, respectively. On average, on the 8 datasets, FES allows a reduction of 0.5 compared to EMnGA.

5 Conclusion and Future Work

In this paper we proposed an ensemble selection method using diversity based entropy measurement combined to a genetic algorithm based strategy search. The proposed method was compared with the FES method [27] which uses a forward selection course for heterogeneous ensembles. Based on the comparison made against the criteria of performance and size of the ensemble generated, we have observed that the EMnGA method generate larger ensembles in most cases compared to FES due mainly to the strategy search method used to explores more possibilities but gives better performance.

In future work, we will compare the measure with other methods in the literature using other search strategies and will use more datasets in order to formally validate our results with appropriate statistical tests. We will also experience a measure based on diversity and performance in order to see the impact of jointly using both criteria. Finally, we intend to apply the EMnGA method in random forest ensemble simplification.