Keywords

1 Introduction

Ensemble methods have been proposed to reduce the variance of individual classifiers and consists of two phases: (1) the models generation step in which models can be added without risk of overfitting; these models can be either homogenous or heterogeneous, in the homogenous case models are obtained from the same learning algorithm by varying parameters or using data resampling [6], manipulating input and/or output attributes [1, 7, 12, 17, 29]; (2) the combination step in which the models are aggregated by voting or weighted voting.

However, these methods are known to generate a large number of models which can lead to large storage space and considerable time for classification and prediction. Ensemble selection allows the size reduction of an ensemble consisting of predictive models using different measures based on the diversity and/or the accuracy of the models, this selection allows obtaining smaller ensembles with higher accuracies comparing to the initial ensembles.

Several selection methods have been proposed in the literature; these methods are essentially based on an evaluation function [3, 9, 24] that determines if a model M contributes positively to boost the performances of the whole ensemble. The evaluation is made based on two important properties of ensemble which are diversity and accuracy. Diversity qualifies for a set of classifiers their ability to agree in greater number on good class predictions, and to disagree on classification errors. The diversity and accuracy properties are considered separately [9, 13, 24] or together [4, 11].

In this paper, we propose a novel evaluation function named Diversity and Accuracy for Pruning Ensembles (DAPE) based on both diversity and accuracy, the method is applied on homogenous ensembles composed of C4.5 decision trees [26]. This method is based on a DHCEP (Directed Hill Climbing Ensemble Pruning) strategy with a multi-objective function to evaluate the relevance of an ensemble of trees. The function, used in a Hill Climbing process in Forward Selection (FS), allows selection of ensembles with the best compromise between maximum diversity and minimum error rate.

A comparative study of the proposed measure and the UWA measure [25] is carried out on datasets from the UCI Repository [2].

The paper is structured as follows: In Sect. 2 we present the bagging used for homogenous ensemble generation, the combination using weighted and unweighted voting and finally the hill climbing method used for the search of a solution. Section 3 gives a review on recent works in the same domain. Section 4 introduces the proposed method in details. The penultimate section contains a comparative study and experimental results on multiple datasets. Finally we conclude and give some future works.

2 Background

In this section, we highlight the basic elements used in this paper, namely, the bagging method used for the generation of the initial ensemble, aggregation by unweighted and weighted vote.

2.1 Diversification by Bagging

Bagging Bootstrap Aggregating, a resampling method introduced by Breiman in 1996 [6]. Given a learning sample ΩL and a learning method which generates a predictor ĥ(.,ΩL) using ΩL. The principle of bagging is to draw several bootstrap samples \( \left( {\Omega _{\text{L}}^{{\uptheta 1}} , \ldots ,\,\Omega _{\text{L}}^{{\uptheta{\text{q}}}} } \right) \) and generate for each one a collection of predictors \( (\hat{h}(.,\Omega _{\text{L}}^{{\uptheta 1}} ), \ldots ,\,\hat{h}(.,\Omega _{\text{L}}^{{\uptheta{\text{q}}}} )) \) using the base learning method for finally aggregating them.

A bootstrap sample \( \Omega _{\text{a}}^{l} \) is obtained by randomly drawing n observations in the starting sample ΩL. Each observation have the probability of 1/n of being shot; |ΩL| = n, the random variable \( \uptheta_{l} \) represents the random drawing.

Initially, Bagging was introduced with a decision tree as basic rule. But the schema is general and can be apply to other basic rules. Bagging transforms an unstable algorithm into a rule with very good properties (consistency and optimal speed of convergence) [5] (Fig. 1).

Fig. 1.
figure 1

Representative diagram of bagging.

2.2 Aggregation (Unweighted Vote, Weighted Vote)

The unweighted and weighted voting are the most used methods for combining (aggregating) whether homogenous or heterogeneous models. In ensemble methods each model, for an instance, gives a class value, a probability, and the class with most votes, highest average probability is assigned to the instance by the ensemble.

In weighted vote, the classification models are associated with weights assigned relatively to their classification accuracy. Formally this can be written: [25] Let x be an instance and \( m_{{i,{\text{i}} = 1..{\text{k}}}} \) a set of models that output a probability distribution m i (x, cj) for each class \( {\text{c}}_{{{\text{j,j}} = 1 \ldots {\text{n}}}} \). The output of the (weighted) voting method y(x) for instance x is given by the following mathematical expression:

$$ y\left( x \right) = argmax_{{c_{j} }} \sum\nolimits_{i = 1}^{k} {w_{i} m_{i} \left( {x,\,c_{j} } \right)} $$
(1)

Where w i is the weight of model i. In the simple case of voting (unweighted), the weights are all equal to one, that is, \( w_{{i,{\text{i}} = 1 \ldots {\text{k}}}} = 1 \).

2.3 The Hill Climbing

Hill climbing is an optimization technique belonging to the family of local search. The algorithm starts with any solution to a problem, and then tries iteratively to find a better solution by changing one element of the solution. If the change produces a better solution (maximize or minimize the evaluation function used for the course), an incremental change is made to the new solution. The process is repeated until no improvements can be found (the function reached the maximum or the minimum).

Hill climbing attempts to maximize (or minimize) a target function f(X) where X is a vector of continuous and/or discrete values. Each iteration, hill climbing will adjust a single element in X and determine if the change improves the value of f(X). Any change improving the function f(X) is accepted, the process continues until no amelioration of the function can be found.

For ensemble selection, DHCEP (Directed Hill Climbing Ensemble Pruning) is used, in this case the vector X is composed of classifiers or predictors. The path can be realized either in backward elimination or in forward selection, in the first case the whole ensemble is considered as a solution and then repeatedly elements not improving the evaluation function are eliminated one by one, in the second case we initialize with an element randomly and we add the elements that improve the evaluation function one by one. The elements to be added or removed are part of the neighborhood of the current solution (Fig. 2).

Fig. 2.
figure 2

Hill climbing search for selection in an ensemble composed of four classifiers [25].

3 Related Work

Several methods have been proposed to reduce the size of a set of classifiers [14, 16, 18, 19, 21, 23, 27]. The methods differ from each other by the adopted research directions, the different evaluation measures or the evaluation ensembles used.

A first type of approaches uses performance measures. Fan et al. [13] propose a profit-based evaluation function and propose dynamic scheduling to accelerate the prediction process. For the reduction of the ensemble size, the total benefit is used as selection criterion in conjunction with a greedy search algorithm with and without back fitting. The path begins with the model with the greatest benefit. A set of instances x is considered, each instance (x) can be positive or negative, B(x) denotes the benefit of predicting x as positive and the total benefit \( {\text{BT}} = \sum\nolimits_{x} {B\left( x \right)} \), the authors choose the sub ensembles which maximize the total benefit. Caruana et al. [9] use several performance metrics and a hill climbing strategy for building ensembles of models from libraries of thousands of models. Model libraries are generated using different learning algorithms. The Forward Stepwise selection is used to add to the ensemble the models that maximize its performance.

The second type of approaches uses diversity-based measures [3, 20, 24]. Partalas et al. [25] propose a measure of diversity that considers all the cases that may exist when adding a model h t to an ensemble. The measure Uncertainty Weighted Accuracy (UWA) considers four cases when adding and using justified weights to distinguish favor cases from others.

$$ \begin{array}{*{20}c} {UWA_{Eval} \left( {h_{k} ,\,Sub} \right)\, = \,\sum\nolimits_{i = 1}^{{\left| {Eval} \right|}} {\left( {\alpha \,*\,I(y_{i} = h_{k} \left( {x_{i} } \right)ETy_{i} \, \ne \,Sub\left( {x_{i} } \right)} \right)\, - \,\beta *} } \\ {I\left( {y_{i} \, \ne \,h_{k} \left( {x_{i} } \right)ETy_{i} = Sub\left( {x_{i} } \right)} \right)\, + \,\beta \,*\,I\left( {y_{i} = h_{k} \left( {x_{i} } \right)ETy_{i} = Sub\left( {x_{i} } \right)} \right)\, - \,\alpha *} \\ {\left. {I\left( {y_{i} \, \ne \,h_{k} \left( {x_{i} } \right)ETy_{i} \, \ne \,Sub\left( {x_{i} } \right)} \right)} \right)} \\ \end{array} $$
(2)

The parameters α, β represent respectively the number of models in the sub-set Sub correctly classifying the instance (x i , y i ) and the number of models incorrectly classifying the same instance.

Li et al. [22] theoretically deals with the effect of diversity on voting generalization performance using Probably Approximately Correct (PAC) learning. It is revealed that diversity is closely related to the space complexity hypothesis, and strengthening it can be achieved by applying regularization to ensemble methods. Based on this analysis, the authors apply an explicit regularization of the diversity for the selection of ensembles.

Zhou et al. [30] propose a new algorithm based on frequent item learning that links data and the simplified ensemble to a transactional database whose transactions are instances and items are classifiers. A Boolean classification matrix is used for each model of the pruned ensemble. Using this matrix, several candidate ensembles are obtained by iterative and incremental extraction of basic classifiers with the best performances.

Bhatnagar et al. [4] perform ensemble selection using a performance-based and diversity-based function that considers the individual performance of classifiers as well as the diversity between pairs of classifiers. A bottom-up search is performed to generate the sub ensembles by adding various pairs of classifiers with high performance.

Cavalcanti et al. [10] combine in pairs different matrices of diversity using a genetic algorithm. The combined diversity matrix is then used to group similar (not very diverse) models; they must not belong to the same ensemble. To generate candidate ensembles, the combined diversity matrix is transformed into one or more graphs and then a graph coloring technique is applied.

Guo et al. [15] propose a new metric using the margin (instances) and the diversity (of classifiers) to explicitly evaluate the importance of individual classifiers. By adding the models to the ensemble in decreasing order of the metric, the user can choose the first T models to form a sub-ensemble.

Qun et al. [11] emphasize the utility of optimizing predictive performance together with diversity, which are two indispensable and inseparable parameters for ensemble selection. There have been three measures proposed to simplify ensembles using a greedy algorithm: (1) The first measure simultaneously considers the difference (diversity) between the current subset and the candidate classifier and the performance of each one; (2) The second allows evaluating the diversity within the ensemble and; (3) the last measure reinforces the concern about the accuracy of the resulting sub-ensemble. Experimental results confirm the interest of the three measures which is illustrated by the improvement of performances.

4 The Proposed Method

The set of data Ω is divided into two sub samples ΩL (generally 80% of Ω) for learning and pruning and ΩT (generally 20% of Ω) for testing. A bagging ensemble BE of t C4.5 trees is constructed, \( {\text{BE}}\, = \,\left\{ {{\text{T}}_{1} , \ldots ,\,{\text{T}}_{\text{i}} , \ldots ,\,{\text{T}}_{\text{t}} } \right\} \), using ΩL with |ΩL| = n. Each tree T i is represented by a vector \( (x_{1i} ,\,x_{2i} , \ldots ,\,x_{ji} , \ldots ,\,x_{ni} )^{T} \). We have the following notations:

  • x ij : Result of classification of the individual i by the tree j, x ij  = 1 if the individual i is misclassified by the tree T j and x ij  = 0 otherwise,

  • xi+: The total number of errors committed for the individual I:

    $$ x_{i + } \, = \,\sum\nolimits_{j = 1}^{t} {x_{ij} } $$
    (3)
  • X: The total number of errors committed by the set,

    $$ X\, = \,\sum\nolimits_{i = 1}^{n} {\sum\nolimits_{j = 1}^{t} {x_{ij} } } $$
    (4)
  • i , xi+): The relative distribution of the error frequencies associated with the different cases,

    $$ \theta_{i} \, = \,\frac{{x_{i + } }}{X},\,i\, = \,1,\,n $$
    (5)
  • x+j: The number of errors committed by the classifier Tj over all the individuals,

    $$ x_{ + j} \, = \,\sum\nolimits_{i = 1}^{n} {x_{ij} } $$
    (6)
  • e j : The error rate associated with the tree T j ,

    $$ e_{j} \, = \,\frac{{x_{ + j} }}{n} $$
    (7)

The evaluation function to optimize noted S connects diversity ϴ i and the error rate e j (α is a parameter for which values are chosen empirically):

$$ S\, = \,\sum\nolimits_{i = 1,n} {\theta_{i}^{2} \, + \,\alpha } \sum\nolimits_{j = 1,t} {e_{j}^{2} } $$
(8)

The algorithm below presents the proposed method DAPE in a pseudocode:

figure a

5 Experiments

The experiments consist in building homogeneous sets by sampling the starting sample and using the C4.5 decision tree generation algorithm as a basic rule [26]. The Weka platform [28] is used as a source for the C4.5 learning algorithm and validation. We consider 10 data sets from the UCI Repository [2] which are described in Table 1:

Table 1. Description of datasets.

The proposed method, DAPE, is compared to the ensemble pruning method based on diversity UWA [25] detailed in literature and whose source code is available at http://mlkd.csd.auth.gr/ensemblepruning.html. For two methods, the unweighted majority vote is used for the combination of models and the performance calculation.

The methods use a forward selection strategy in a hill climbing scheme. The stop criterion for UWA is the performance on the evaluation sample which generates subsets of reduced sizes compared with the usual stop criteria defined as a fixed number of models [24]. In our case we use the same function for both the path and the stop.

We use two criteria to compare the pruning methods; the performance and the size of the subsets obtained, ALL designs the ensemble composed of all the models (Tables 2 and 3).

Table 2. Comparativestudyof DAPE and UWA based on accuracy of obtained sub ensembles.
Table 3. Comparative study of DAPE and UWA based on the size of obtained sub ensembles.

For 6 out of 10 benchmarks, DAPE shows better performances compared to the UWA which scores only 2 better performances, the whole ensemble gives better performances in only 2 cases.

For the average success rate on all datasets, DAPE is ranked first with a rate of 0.885165, exceeding UWA with a rate of 0.5%, and the whole ensemble with 1%.

All the methods DAPE, DEAS and UWA allow obtaining subsets of reduced sizes for 5 cases, 50% of the cases. For the average size on the 10 datasets, DAPE is ranked first with an average size of 11.31.

6 Conclusion and Future Work

In this paper we presented a new evaluation function combining performance and diversity for selection in a homogeneous ensemble used in a search process based on climbing hill. The method was evaluated on several benchmarks and compared to the method UWA based only on diversity, the results show the superiority of the proposed method.

We used the DAPE method for the selection in heterogeneous ensembles; where the classifiers are not generated from the same learning algorithm and for the selection in random forest ensembles, knowing that a random forest ensemble improves the performance of a bagging [8].

We also propose to study the possibility of using another search strategy for the selection in order to reduce the search time of the ensembles because the hill climbing strategy requires a non-negligible time when the number of models becomes important.

The last point consists in finding a value for the parameter α for which we have noticed during empirical research that for which an appropriate value could significantly improve the results as observed in our experiments.