1 Introduction

1.1 Ensemble of probabilistic classifiers

Classification is one of the main tasks of Supervised Learning. Given a dataset consisting of information about objects relating to some attributes that describe them, and to a categorical output variable (the class to which each object belongs), with r ≥ 2 different possible classes y1,…,yr, a classifier is an algorithm that allows to infer the class of a new object or instance from its known attributes. Different methodologies are used in Machine Learning to learn classifiers from a dataset of solved cases in which both, the attributes and the class, are consigned for each of the instances (except missing data). Among them, we are interested in probabilistic classifiers, which not only predict the class, but estimate the probability distribution over the set of classes, being the predicted class that one with the highest associated probability, that is, the most likely class that the instance should belong to, following the maximum a posteriori (MAP) criterium.

Probabilistic classifiers provide a prediction that can be useful in its own right, and particularly when classifiers are combined to create ensembles. Ensemble of classifiers (also known as “combining classifiers”, see [18]) is a technique that has been widely applied in classification learning, the idea being to build a new classifier combining a set of base classifiers, in the hope of improving their behaviour, as it effectively emerges from different works (see [5, 11, 16, 18, 27]). Obviously, an important research topic in this field is that of combination schemes: their comparison and how to choose between them, as well as the type of base classifier used to build the ensemble. For example, in [20] the authors experimentally prove that the ensembles of Naive Bayes classifiers following different schemes are significantly better than standard Naive Bayes, and also slightly better than an ensemble of Naive Bayes and decision tree.

1.2 Bagging

As single decision classifiers can suffer from high variance, a simple way to address this flaw is to use them in the context of randomization-based ensemble methods, by introducing random perturbations into the learning procedure in order to produce several different classifiers from a single training set. It is the case of bagging, an acronym for Bootstrap AGGregatING, as it was introduced by Breiman in [6]. Indeed, bagging is a meta-classifier used to reduce the variance of a base classifier by introducing randomization into its construction through the use of learning datasets obtained as bootstrap samples of the original learning set, that is, fits base classifiers, each on random subsets of the original dataset drawn with replacement, and then aggregate their individual predictions making an ensemble out of them by means of the majority vote combiner scheme. With this procedure we lose the interpretability of a single simple classifier, but potentially gain in predictive power.

In many cases, bagging is a simple way to improve the predictive power of a single model without needing to change the underlying classifier. The improvement on it, using the bagging procedure, is obtained for unstable underlying classifiers, as is the case of a decision tree (see [6]), which will be the one we will use in our experimental phase. Paraphrasing Breiman ([6]), we can say that “The evidence[...] is that bagging can push a good but unstable procedure a significant step towards optimality. On the other hand, it can slightly degrade the performance of stable procedures.”.

Random forest, which are outstanding examples of probabilistic classifiers, were also introduced by Breiman in [7], as a variant of bagging in which a randomization of the input attributes is used when considering candidates to split internal nodes (see also [26]). In addition, ensemble of classifiers is also the main idea behind boosting (see for instance [14] for general description of boosting and [10] for an application in the field of medicine). A comparison of the effectiveness of randomization, bagging and boosting to improve the performance of the decision tree algorithm C4.5 can be found in [12].

1.3 Combiner schemes

In general, we build M ≥ 2 different base probabilistic classifiers, \({\mathcal C}_{1},\ldots , {\mathcal C}_{M}\), which may or may not correspond to the bagging meta-algorithm, and then combine their outputs to construct an ensemble. The final decision of the ensemble is derived using a combination rule, that can fall into one of following two groups (see for instance [30]):

  1. i)

    Hard voting: combination rules that apply to class labels, as the simple majority vote, which is just by going with the prediction that appears the most times in the base classifiers, and is the one used by bagging. The criterion of the majority vote scheme coincides, in the binary setting, with the classical Condorcet criterion, according to which to be the winner a class must win one-on-one matches with all other classes, that is, must be preferred to each other class when compared to them one at a time.

  2. ii)

    Soft voting: combination rules obtained by polling the continuous outputs of each base classifier using a function (average, maximum, minimum, product,... see [18, 19]) that returns the class label that maximizes the value of the function applied to the predicted probabilities, the average being the strongest from a viewpoint of predictive power. The average combiner is a natural competitor of the majority vote for bagging, showing similar results in the experimental phase of [6] (Section 6.1). Our experimental evidence is different, however, as we will see in Section 6.3, finding confirmation that the average scheme outperforms the majority vote, although at the price of a greater computational requirement.

The combiner scheme that we will introduce in this work is halfway between the majority vote and the average, we will see latter in what sense, so it can be considered a semi-hard voting scheme.

Other possible grouping of the combination rules is trainable vs. non-trainable. The simple majority vote and the average are non-trainable, but their usual weighted counterpart are trainable, since the weights are determined through a separate training algorithm, usually as a function of the estimated accuracies of the base classifiers. Using non-trainable rules provides simplicity and is less memory and computationally demanding.

The combiner scheme that we propose is non-trainable although it is a weighted version of the simple majority vote, because its weights are obtained just as a measure of the degree of support that any of the base classifiers assigns to the class it predicts. As measure of the degree of support we propose the confidence level CL, which for each base classifier is the degree of support to its own choice. But it could also be natural to think about assigning another alternative measure, namely the modified confidence level (MCL), which takes into account additional knowledge of the probability distribution between classes. The corresponding non-trainable weighted versions of the simple majority vote scheme are named in what follows: CL-MV and MCL-MV, for Confidence Level (respectively, Modified Confidence Level) based Majority Vote.

From a heuristic perspective, our contribution consists of supporting the following hypotheses through experimentation with different datasets:

  • Main hypothesis: using CL-MV (or MCL-MV) instead of simple majority vote, the obtained scheme significantly improves the classifying power of bagging.

  • Secondary hypotheses:

    • Although there is no clear significant difference between them, in some cases the MCL-MV scheme gives better results than CL-MV (and in some others, the opposite).

    • Although bagging with the average rule outperforms that with the simple majority vote, the same does not happen with CL-MV nor MCL-MV, which hold up and do not show statistically significant differences with the average, while they are computationally less demanding.

And from a theoretical perspective, our main contribution is to obtain two important results on the combiner schemes comparison:

  • In the binary case, we prove that if the number of base classifiers is odd, the accuracy of CL-MV is greater than that of the majority vote, if the probability for each base classifier to give correct label is big enough (Section 3).

  • In the general multi-class setting, we perform a sensitivity analysis (Section 4) showing that under some reasonable hypothesis, CL-MV is more resilient to probability estimation errors than both the average and the product rules.

1.4 Measures of performance comparison

To compare the predictive power of the different combiner schemes, we will perform a series of experiments using real datasets following the bagging procedure. We denote by C = (Cij)i,j= 1,…,r a general confusion matrix, with Cij being the number of instances in the testing dataset that belong to class j and have been assigned to class i by the classifier, and we compare the goodness of the ensembles as classifiers on unseen data in the validation process using two different performance measures that can be applied both in the binary and in the multi-class setting:

  • Accuracy: this performance measure is one of the most intuitive and appealing, and is defined from C in this way:

$$\text{Accuracy}=\frac{{\sum}_{i=1}^{r} C_{ii}}{{\sum}_{i=1}^{r} {\sum}_{j=1}^{r} C_{ij}} .$$

Accuracy ranges between 0 and 1, the latter corresponding to perfect classification.

  • Matthews Correlation Coefficient (MCC): is a more subtle performance measure, which was first introduced in the binary case by B.W. Matthews [21] as a measure of association obtained by discretization of the Pearson’s correlation coefficient for two binary vectors, and was generalized in [15] to multi-class classification. MCC has proven to be more reliable as a metric for classification than Cohen’s Kappa, which has been used in many works for a long time (see [9]). Its definition is as follows:

$$ \text{MCC}=\frac{\sum\limits_{k,\ell,m=1}^{r} (C_{kk} C_{\ell m}-C_{mk} C_{k\ell})} {\sqrt{\sum\limits_{k=1}^{r} \left( \left( \sum\limits_{\ell=1}^{r} C_{k\ell}\right)\left( \sum\limits_{u,v=1, u\ne k}^{r} C_{uv}\right)\right)} \sqrt{\sum\limits_{k=1}^{r} \left( \left( \sum\limits_{\ell=1}^{r} C_{\ell k}\right)\left( \sum\limits_{u,v=1, u\ne k}^{r} C_{vu}\right)\right)}} $$

MCC also assumes its theoretical maximum value of 1 when classification is perfect, but ranges between − 1 and 1.

In both cases, the larger the metric value, the better the classifier performance.

1.5 Description of the sections

The remainder of the paper is structured as follows. After introducing the CL-MV combiner scheme in Section 2, in Section 3 we compare the accuracies of CL-MV and the majority vote in the binary case, and in Section 4 we investigate the sensitivity of the average, product and CL-MV schemes to probability estimation errors, in the general multi-class setting. The Modified Confidence Level MCL is introduced in Section 5 as a degree of support in classification alternative to the confidence level CL, showing some properties in Appendix ??. Without diving into computationally demanding experiments, Section 6 describes the used datasets, the experimental design aimed to compare the predictive power and the computational complexity of the considered combiner schemes for bagging, and the obtained results (of which some complementary tables are allocated in Appendix ??). We finish with a discussion in Section 7, and some words by way of conclusion in Section 8.

2 The CL-MV combiner scheme

We build a novel ensemble classifier from M base classifiers \({\mathcal C}_{1}, \ldots , {\mathcal C}_{M}\), by introducing a modification of the simple majority vote scheme, and we name it the Confidence Level based Majority Vote, CL-MV. This combiner scheme uses the classifications given by the base classifiers themselves along with their corresponding estimated probability distributions, in this way:

If pjk denotes the probability that classifier j assigns to class k, the predicted class by this classifier following the maximum a posteriori probability (MAP) criterium, is that with the largest assigned probability. That is, the predicted class by the j-th probabilistic classifier is

$$ y^{*}_{j}=y_{\ell} \text{if} \ell=\arg \max_{k=1,\ldots,r} p_{jk} ,\quad \text{and}\quad \text{CL}_{j}= \max_{k=1,\ldots,r} p_{jk} $$

is said to be the confidence level of the prediction, which is interpreted as a degree of support to it. (It is understood that if the point where the maximum is reached is not unique, one of them is chosen by a tie-breaking rule.) In general, a combiner scheme predicts in the following way:

$$ y^{*}_{ensemble}=y_{\ell}\quad\text{with}\quad \ell= \arg \max_{k=1,\ldots,r} g_{k} $$
(1)

for some function gk. Consider the following particular cases:

  1. 1.

    Majority vote: \(g_{k}={\sum }_{j=1}^{M} d_{jk}, \text {where} d_{jk}=\begin {cases} 1 &\text { if } y_{j}^{*}=y_{k}\\ 0 & \text { otherwise.} \end {cases}\)

    Weighted majority vote: \(g_{k}={\sum }_{j=1}^{M} \omega _{j} d_{jk}, \text {where} d_{jk}\) is as in the simple majority vote, and ωj is the weight assigned to classifier \({\mathcal C}_{j}\), usually obtained from its estimated accuracy (trainable combiner).

  2. 2.

    Average: \(g_{k}=\frac {1}{M} {\sum }_{j=1}^{M} p_{jk}\) (the Sum combiner is equivalent, with the same gk but without dividing by M).

  3. 3.

    Product: \(g_{k}={\prod }_{j=1}^{M} p_{jk}\)

  4. 4.

    Minimum: \(g_{k}=\min \limits _{j=1,\ldots ,M} p_{jk}\)

  5. 5.

    Maximum: \(g_{k}=\max \limits _{j=1,\ldots ,M} p_{jk}\)

We propose the following non-trainable weighted version of the majority vote combiner scheme, based on the confidence level CL as it degree of support:

  1. 6.

    CL-MV: \(g_{k}={\sum }_{j=1}^{M} \omega _{j} d_{jk}\), with djk as in the majority vote and ωj = CLj.

Note that ωj needs no other separate training algorithm to be learned, making CL-MV a non-trainable combiner scheme, and that for each classifier, it only uses the maximum of its probability distribution, unlike the average, which uses all the values of the probability distribution. Also note that by definition of djk, gk only depends on the value of the weight ωj for those j such that djk = 1, that is, when \(y_{j}^{*}=y_{k}\), and we can assume without loss of generality that otherwise ωj = 0. For that, we introduce the notation \(y(k)=\{j=1,\ldots ,M : y_{j}^{*}=y_{k}\}\) for any k = 1,…,r, and then, with this notation, we can rewrite:

  1. 1.

    Majority vote: gk = #y(k).

    Weighted majority vote: \(g_{k}={\sum }_{j\in y(k)} \omega _{j}\) where ωj is the weight assigned to classifier \({\mathcal C}_{j}\) by a separate learning algorithm.

  2. 6.

    CL-MV: \(g_{k}={\sum }_{j\in y(k)} \omega _{j}\) where ωj = CLj.

(Here and in the sequel, we use the convention that a sum over an empty set, is zero.)

Remark 1

On the binary case (r = 2), we see in Proposition 1 below that under certain circumstances, CL-MV matches the average scheme.

Proposition 1

In binary classification (r = 2), suppose that \(y_{\text {CL-MV}}^{*}=y_{k}\). Therefore, if #y(k) ≤ M/2, we can ensure that \(y_{Average}^{*}=y_{k}\), that is, CL-MV and the average schemes give the same prediction.

Before giving the proof, let’s look at two examples in Table 1, with M = 5 classifiers and r = 2 classes, where the probabilities pjk are listed: in example a), the hypothesis and the thesis of Proposition 1 are fulfilled, in example b), neither the hypothesis nor the thesis are (it is a counterexample that without the hypothesis, the thesis is no longer true).

Table 1 Illustrative examples of Proposition 1

Indeed, example a) in Table 1 shows that \(y_{Average}^{*}=y_{2}=y_{\text {CL-MV}}^{*}\) and #y(2) = 2 ≤ 5/2, while example b) gives \(y_{Average}^{*}=y_{2}\) and \(y_{\text {CL-MV}}^{*}=y_{1}\), which is possible since #y(1) = 3 > 5/2.

Proof Proof of Proposition 1

Without loss of generality we can assume that \(y_{\text {CL-MV}}^{*}=y_{1}\). This means that

$$ {\sum}_{j \in y(1)} \text{CL}_{j} \ge {\sum}_{j\in y(2)} \text{CL}_{j} $$
(2)

Moreover, we assume that #y(1) ≤ M/2, which implies that #y(1) ≤ #y(2) since M = #y(1) + #y(2). Then, from (2) we have that

$$ {\sum}_{j \in y(1)} \left( 1-\text{CL}_{j} \right) \le {\sum}_{j\in y(2)} \left( 1-\text{CL}_{j}\right) $$
(3)

and then, adding term to term (2) and (3), we obtain

$$ {\sum}_{j \in y(1)} \text{CL}_{j} + {\sum}_{j\in y(2)} \left( 1-\text{CL}_{j}\right) \ge {\sum}_{j\in y(2)} \text{CL}_{j} + {\sum}_{j \in y(1)} \left( 1-\text{CL}_{j} \right) , $$

that is, \(\sum \limits _{j=1}^{M} p_{j1} \ge \sum \limits _{j=1}^{M} p_{j2}\), which implies that \(y_{Average}^{*}=y_{1}\).

(If CL-MV classifies without ties, that is, \({\sum }_{j\in y(1)} \text {CL}_{j} > {\sum }_{j\in y(2)} \text {CL}_{j}\), therefore, \(\sum \limits _{j=1}^{M} p_{j1} > \sum \limits _{j=1}^{M} p_{j2}\), that is, the average scheme does too.)

In words, unlike what happens with the simple majority vote, which chooses the class with the most votes from among the base classifiers (for each classifier, the counter of each class adds 1 if the classifier predicts that class, and 0 otherwise, and the combiner chooses the class with the highest counter), with the CL-MV combiner scheme, the counter of each class adds the measure of the degree of support (CLj) if the classifier \({\mathcal C}_{j}\) predicts that class, and 0 otherwise. Let us see the toy example in Table 2 below, in which we have M = 5 classifiers and r = 3 classes, and the probability distributions provided by each classifier. The corresponding predictions with any of the base classifiers and with the non-trainable ensembles are also given.

Table 2 In boldface CL in the probability distributions, and also the maximum of the sum, the average, the product, the minimum and the maximum of the predicted probabilities, and the maximum of the sum of the weights for both the majority vote and the CL-MV combiners

As can be seen in Table 2, the predictions with the ensembles are:

\(y_{\text {Sum}}^{*}=y_{\text {Average}}^{*}=y_{\text {Product}}^{*}=y_{\text {Minimum}}^{*}=y_{3},\)

\(y_{\text {Majority}}^{*}=y_{1}, y^{*}_{\text {CL-MV}} = y_{\text {Maximum}}^{*}=y_{2}\).

The main highlights of the CL-MV combiner scheme are:

  1. 1.

    It is a fusion not of the predictions but of the degree of support given to the predictions, where this degree of support is defined from the probability distribution over the classes assigned by the classifier, as the confidence level.

  2. 2.

    It is non-trainable, that is, no extra parameters need to be trained and it is ready to run as soon as the base classifiers are available.

  3. 3.

    It can provide different predictions both from those provided by the majority vote and by the average criteria (see the toy example in Table 2).

  4. 4.

    In the binary case, we found a scenario where it gives the same prediction as the average (Proposition 1), and we will show that its accuracy is greater than that of the majority vote (see Section 3 below).

  5. 5.

    From the point of view of the sensitivity to probability estimation errors, under some reasonable hypotheses, CL-MV is more resilient than both the product and the average schemes (see Section 4 underneath).

The pseudo-code for the algorithm of classification corresponding to the ensemble given by the novel CL-MV combiner scheme is Algorithm 1 below, and presents the benefit of being simple and easy to implement. For the sake of completeness, the pseudo-code for the standard majority vote and the average schemes are also included as Algorithm 2 and Algorithm 3, respectively. Those for the product, minimum and maximum combiner schemes are analogous to Algorithm 3, simply modifying the definition of gk in line 4 properly, and then omitted.Remove ∖newpage tag below

figure b
figure c
figure d

Remark 2

As we can verify empirically in Section 6.3, from a computational and saving of storage space point of view, the majority vote (hard voting) presents the advantage that once we know the prediction of any of the base classifiers, \(y_{j}^{*}\), we do not need to store any other information about the probability distributions of the predictions. At the other extreme, the average scheme (soft voting) needs to store and use all the values of the distribution, making it more computationally and storage-demanding. The CL-MV combiner is halfway between them, using only the maximum of the distribution. That is why we say that CL-MV is a semi-hard voting combiner scheme.

3 Accuracy in the binary case

Majority vote is one of the most popular combiner schemes, if not the most. In Section 4.2 [19], to find out why that is so, the author studies in deep its accuracy, assuming that

  1. i)

    the number of classes is r = 2 (binary case),

  2. ii)

    the number of classifiers M is odd, say M = 2L + 1 with L ≥ 1,

  3. iii)

    the base classifiers outputs are independent (this condition may seem unrealistic, but for many applications it holds, at least approximately),

  4. iv)

    the probability for each base classifier to give correct class label is p ∈ (0,1).

The majority vote will predict an accurate class label if the simple majority of the base classifiers vote for it, that is, if at least ⌊M/2 + 1⌋ of them give the correct prediction (where ⌊x⌋ denotes the integer part or floor of x). Therefore, the accuracy of the ensemble based on the majority vote combiner is:

$$ Acc_{majority}={\sum}_{\ell = \lfloor{M/2}\rfloor+1}^{M} \binom{M}{\ell} p^{\ell} (1-p)^{M-\ell} $$
(4)

Then, it can be seen (Condorcet Jury Theorem, 1785 [24]) that

  1. a)

    if \(p>0.5, \lim _{M\to \infty } Acc_{majority}=1\) and it is monotonically increasing,

  2. b)

    if p = 0.5,Accmajority = 0.5 for all M,

  3. c)

    if \(p<0.5, \lim _{M\to \infty } Acc_{majority}=0\) and it is monotonically decreasing.

And this result supports the intuition that we can expect improvement over the individual accuracy p only if p > 0.5.

In the same scenario, what is the formula for the accuracy of the ensemble based on CL-MV? This combiner scheme gives the correct prediction if the number of classifiers that predict correctly is at least α, with α being the minimum integer such that

$$p \alpha > (1-p) (M-\alpha),$$

which is equivalent to say that α is the minimum integer such that α > (1 − p)M. Then, α = ⌊(1 − p)M⌋ + 1, and in consequence, the accuracy is:

$$ Acc_{\text{CL-MV}}=\sum\limits_{\ell = \lfloor{(1-p) M}\rfloor+1}^{M} \binom{M}{\ell} p^{\ell} (1-p)^{M-\ell} $$
(5)

We can easily prove the next result:

Proposition 2

In the binary case and with an odd number of base classifiers M, we have that

$$ \begin{cases} \text{If } p>\frac{M+1}{2 M}, &\text{ then } Acc_{\text{CL-MV}}>Acc_{majority},\\ \text{If } \frac{M-1}{2 M} \le p \le \frac{M+1}{2 M}, &\text{ then } Acc_{\text{CL-MV}}=Acc_{majority},\\ \text{If } p<\frac{M-1}{2 M}, &\text{ then } Acc_{\text{CL-MV}}<Acc_{majority}. \end{cases} $$

In particular, since (M + 1)/(2M) > 1/2, Proposition 2 implies that if p > (M + 1)/(2M), CL-MV strictly improves accuracy with respect to the majority vote combiner scheme, and by the Condorcet Jury Theorem, AccCL-MV is monotonically increasing and

$$\lim_{M\to \infty} Acc_{\text{CL-MV}} = 1 .$$

Proof

By (4) and (5) we see that AccCL-MV > Accmajority if and only if

$$\lfloor{M/2}\rfloor\ge \lfloor{(1-p) M}\rfloor+1\Leftrightarrow \lfloor{(1-p) M}\rfloor \le L-1 \Leftrightarrow (1-p) M < L \Leftrightarrow p>\frac{M+1}{2 M}.$$

On the other hand, AccCL-MV < Accmajority if and only if

$$\lfloor{M/2}\rfloor+1\le \lfloor{(1-p) M}\rfloor\Leftrightarrow (1-p) M \ge L+1 \Leftrightarrow p<\frac{M-1}{2 M}.$$

Otherwise, the accuracies are equal.

4 Error sensitivity

In this section we investigate the sensitivity of the product, the average (equivalently, the sum) and the CL-MV combiner schemes to probability estimation errors, by following the approach of [18].

Assume for a while that probabilities pjk for j = 1,…,M, k = 1,…,r are not computed correctly, rather they suffer from an estimation error. Denote by \(\hat p_{jk}\) the obtained estimates of the probabilities, which are those used by the combination rules to obtain their predictions. As the model of additive errors is the most popular error model in statistics, we assume that the estimation error is additive, that is, for any j and k,

$$\widehat p_{jk} = p_{jk} + \varepsilon_{jk} ,$$

where errors εjk are small (in absolute value). In particular, we assume that errors do not affect the individual prediction of any base classifier (\(y_{j}^{*}\) is not affected by errors). In other words, we assume that for any j = 1,…,M,

$$ \arg \max_{k=1,\ldots,r} p_{jk} = \arg \max_{k=1,\ldots,r} \widehat p_{jk} . $$
(6)

This implies that y(k) is also unaffected by the probability estimation errors.

We are concerned about the effect that those probability estimation errors will have on the predictions obtained by the ensembles following the different combination rules. By (1), now \(y^{*}_{ensemble}=y_{\ell }\) with \(\ell = \arg \max \limits _{k=1,\ldots ,r} \widehat g_{k}\), where \(\widehat g_{k}\) denotes the estimate of function gk obtained substituting probabilities pjk by their estimates \(\widehat p_{jk}\). In what follows, we concentrate on the product, the average and the CL-MV combiner schemes.

  1. a.

    The product scheme: \(g_{k}=\prod \limits _{j=1}^{M} p_{jk}\).

    Following [18], we can write

    $$ \begin{array}{@{}rcl@{}} \widehat g_{k} &=& {\prod}_{j=1}^{M} \widehat p_{jk} = {\prod}_{j=1}^{M} (p_{jk}+\varepsilon_{jk})=\left( {\prod}_{j=1}^{M} p_{jk}\right) {\prod}_{i=1}^{M} \left( 1+\frac{\varepsilon_{ik}}{p_{ik}}\right)\\ & \approx& \left( {\prod}_{j=1}^{M} p_{jk}\right) \left( 1+{\sum}_{i=1}^{M} \frac{\varepsilon_{ik}}{p_{ik}}\right) = g_{k} \left( 1+{\sum}_{i=1}^{M} \frac{\varepsilon_{ik}}{p_{ik}}\right), \end{array} $$

    where we have made a linear approximation (we neglect higher order terms since the errors εjk are small and therefore, the product of two or more of them is of a very small order). That is, if the probabilities pjk are affected by the additive estimation errors εjk, then gk is affected by a multiplicative error \({{\varPsi }}_{k}^{prod}\), where

    $$ {{\varPsi}}_{k}^{prod} = 1+{\sum}_{i=1}^{M} \frac{\varepsilon_{ik}}{p_{ik}}. $$
  2. b.

    The average scheme: \(g_{k}=\frac {1}{M} \sum \limits _{j=1}^{M} p_{jk}\).

    Following [18] again,

    $$ \begin{array}{@{}rcl@{}} \widehat g_{k} &=& \frac{1}{M}{\sum}_{j=1}^{M} \widehat p_{jk} = \frac{1}{M}{\sum}_{j=1}^{M} (p_{jk}+\varepsilon_{jk})=\left( \frac{1}{M} {\sum}_{j=1}^{M} p_{jk}\right) \left( 1+\frac{{\sum}_{i=1}^{M} \varepsilon_{ik}}{{\sum}_{i=1}^{M} p_{ik}}\right)\\ &=& g_{k} \left( 1+\frac{{\sum}_{i=1}^{M} \varepsilon_{ik}}{{\sum}_{i=1}^{M} p_{ik}}\right). \end{array} $$

    In this case, if the probabilities pjk are affected by the additive estimation errors εjk, then gk is affected by a multiplicative error \({{\varPsi }}_{k}^{aver}\), where

    $$ {{\varPsi}}_{k}^{aver} = 1+\frac{{\sum}_{i=1}^{M} \varepsilon_{ik}}{{\sum}_{i=1}^{M} p_{ik}}. $$
  3. c.

    The CL-MV scheme: \(g_{k}=\sum \limits _{j\in y(k)} \text {CL}_{j}\) (assume that k is such that y(k)≠), where CL\(_{j}=\max \limits _{\ell =1,\ldots ,r} p_{j\ell }\).

    Therefore,

    $$ \widehat g_{k} =\sum\limits_{j\in y(k)} \widehat{\text{CL}}_{j}=\sum\limits_{j\in y(k)} \max\limits_{\ell=1,\ldots,r} \widehat p_{j\ell}=\sum\limits_{j\in y(k)} \max\limits_{\ell=1,\ldots,r} \left( p_{j\ell} + \varepsilon_{j\ell}\right). $$
    (7)

    We make the following assumption:

    $$\text{(H$_{1}$)}\qquad \text{for all } k=1,\ldots,r,\quad k=\arg \max_{\ell=1,\ldots,r} \varepsilon_{j\ell}\quad \text{for all } j\in y(k),$$

    that is, \(\varepsilon _{jk}=\max \limits _{\ell =1,\ldots ,r} \varepsilon _{j\ell }\). In words, for each classifier the error is maximum when predicting the highest probability. Under (H1) we have that for any jy(k),

    $$ \max\limits_{\ell=1,\ldots,r} \left( p_{j\ell} + \varepsilon_{j\ell}\right)=\text{CL}_{j}+\varepsilon_{jk}.$$

    Then, substituting into (7), we have that under (H1),

    $$ \begin{array}{@{}rcl@{}} \widehat g_{k} &=&\sum\limits_{j\in y(k)} \left( \text{CL}_{j}+\varepsilon_{jk}\right)= \left( \sum\limits_{j\in y(k)} \text{CL}_{j}\right) \left( 1+\frac{{\sum}_{i \in y(k)} \varepsilon_{ik}}{{\sum}_{i\in y(k)} \text{CL}_{i}}\right) \\ &= &g_{k} \left( 1+\frac{{\sum}_{i \in y(k)} \varepsilon_{ik}}{{\sum}_{i\in y(k)} \text{CL}_{i}}\right). \end{array} $$

    Then, if the probabilities pjk are affected by the additive estimation errors εjk, function gk is affected by a multiplicative error \({{\varPsi }}_{k}^{\text {CL-MV}}\), with

    $$ {{\varPsi}}_{k}^{\text{CL-MV}} = 1+\frac{{\sum}_{i \in y(k)} \varepsilon_{ik}}{{\sum}_{i\in y(k)} \text{CL}_{i}} = 1+\frac{{\sum}_{i \in y(k)} \varepsilon_{ik}}{{\sum}_{i\in y(k)} p_{ik}}. $$
    (8)

Now we can compare the error factors to see which combiner scheme is more resilient to probability

estimation errors, and prove the following result (Proposition 3 below). First, we introduce a hypothesis:

$$ \text{(H$_{2}$)} \text{for all } k=1,\ldots,r,\quad {\sum}_{j\in y(k)} \varepsilon_{jk} {\sum}_{j\notin y(k)} \widehat p_{jk} \le {\sum}_{j\notin y(k)} \varepsilon_{jk} {\sum}_{j\in y(k)} \widehat p_{jk}. $$

The rationale for this assumption is that fixed k, for the classifiers \({\mathcal C}_{j}\) with j in y(k), k is the most probable class, and then \({\sum }_{j\in y(k)} \widehat p_{jk}\) is likely to be sufficiently greater than \({\sum }_{j\notin y(k)} \widehat p_{jk}\) to compensate that \({\sum }_{j\notin y(k)} \varepsilon _{jk}\) could be less than \({\sum }_{j\in y(k)} \varepsilon _{jk}\), since the errors are assumed to be small. This is precisely the situation that we will see, as an illustration, for the example in Table 2 (see Table 3 underneath). Proposition 3 states that the average rule is much less affected by the probability estimation errors than the product rule, and that under reasonable conditions, the CL-MV rule is still less affected by errors than the average.

Table 3 Predicted probabilities \(\widehat p_{jk}\) and estimation errors εjk in brackets, for the example in Table 2, where ε > 0 is small, and δ ∈ (0,1)

Proposition 3

With the previous notations, for any k = 1,…,r,

  1. a)

    If \({\sum }_{i=1}^{M} \widehat p_{ik}\ge 1+{\sum }_{i=1}^{M} \varepsilon _{ik}\), then \( {{\varPsi }}_{k}^{aver} \le {{\varPsi }}_{k}^{prod}\).

  2. b)

    If \({\sum }_{i\in y(k)} \widehat p_{ik}\ge 1 + {\sum }_{i \in y(k)} \varepsilon _{ik}\), then \( {{\varPsi }}_{k}^{\text {CL-MV}} \le {{\varPsi }}_{k}^{prod}\).

  3. c)

    Under (H2), \({{\varPsi }}_{k}^{\text {CL-MV}} \le {{\varPsi }}_{k}^{aver}\).

Proof

The proof of the statement a) can be found in Section 6 [18], but we reproduce it here for the sake of completeness. Indeed, as pik ≤ 1, each error εik is amplified in \({{\varPsi }}_{k}^{prod}\) by dividing it into pik, while for the average, the errors are not amplified, in such a way that

$${{\varPsi}}_{k}^{prod} = 1+{\sum}_{i=1}^{M} \frac{\varepsilon_{ik}}{p_{ik}} \ge 1+{\sum}_{i=1}^{M} \varepsilon_{ik} \!\ge\! 1+\frac{{\sum}_{i=1}^{M} \varepsilon_{ik}}{{\sum}_{i=1}^{M} p_{ik}}={{\varPsi}}_{k}^{aver},$$

where the second inequality is due to the fact that we are assuming that \({\sum }_{i=1}^{M} \widehat p_{ik}\ge 1+{\sum }_{i=1}^{M} \varepsilon _{ik}\), which is equivalent to \({\sum }_{i=1}^{M} p_{ik}\ge 1\). This assumption is likely to happen for the most probable class(es).

The proof of b) is similar for k such that \({\sum }_{i\in y(k)} \widehat p_{ik}\ge 1 + {\sum }_{i \in y(k)} \varepsilon _{ik}\), which is equivalent to \({\sum }_{i\in y(k)} p_{ik}\ge 1\):

$$ {{\varPsi}}_{k}^{prod} = 1+{\sum}_{i=1}^{M} \frac{\varepsilon_{ik}}{p_{ik}} \ge 1+{\sum}_{i\in y(k)} \varepsilon_{ik} \ge 1+\frac{{\sum}_{i\in y(k)} \varepsilon_{ik}}{{\sum}_{i\in y(k)} p_{ik}}={{\varPsi}}_{k}^{\text{CL-MV}}. $$

To see that statement c) holds, we only have to prove that for any k = 1,…,r verifying (H2),

$$ \frac{{\sum}_{i \in y(k)} \varepsilon_{ik}}{{\sum}_{i\in y(k)} p_{ik}} \le \frac{{\sum}_{i=1}^{M} \varepsilon_{ik}}{{\sum}_{i=1}^{M} p_{ik}}. $$
(9)

Indeed, this holds since

$$ \begin{array}{@{}rcl@{}} (9) &\Leftrightarrow& {\sum}_{i \in y(k)} \varepsilon_{ik} \left( {\sum}_{i \in y(k)} p_{ik} + {\sum}_{i \notin y(k)} p_{ik}\right)\\ &&\le {\sum}_{i \in y(k)} p_{ik} \left( {\sum}_{i \in y(k)} \varepsilon_{ik} + {\sum}_{i \notin y(k)} \varepsilon_{ik}\right)\\ &\Leftrightarrow& {\sum}_{i \in y(k)} \varepsilon_{ik} {\sum}_{i \notin y(k)} p_{ik} \le {\sum}_{i \in y(k)} p_{ik} {\sum}_{i \notin y(k)} \varepsilon_{ik}\\ & \!\Leftrightarrow\!& \!{\sum}_{i \in y(k)} \varepsilon_{ik} {\sum}_{i \notin y(k)} \left( \widehat p_{ik} - \varepsilon_{ik}\right)\!\le\! {\sum}_{i \in y(k)} \left( \widehat p_{ik} - \varepsilon_{ik}\right)\! {\sum}_{i \notin y(k)} \varepsilon_{ik}\\ &\Leftrightarrow& {\sum}_{i \in y(k)} \varepsilon_{ik} {\sum}_{i \notin y(k)} \widehat p_{ik} \le {\sum}_{i \in y(k)} \widehat p_{ik} {\sum}_{i \notin y(k)} \varepsilon_{ik}, \end{array} $$

which exactly is (H2).

To illustrate hypotheses made to establish Proposition 3, we will return to the toy example in Table 2 in Table 3 beneath.

Observe that in Table 3 some of the estimation errors are negative. Indeed, as for the true probabilities \({\sum }_{k=1}^{r} p_{jk}=1\) for any j = 1,…,M, and we also assume that the same happens for the predicted probabilities, \({\sum }_{k=1}^{r} \widehat p_{jk}=1\), we have that necessarily, the sum of the prediction errors for any classifier must be zero, that is, \({\sum }_{k=1}^{r} \varepsilon _{jk}=0\). In other words, the sum of the prediction errors by row in Table 3 must be 0. Although we will not give the details, we can see that for δ < 1 big enough and ε > 0 small enough, for instance,

$$\delta > 0.20 \quad \text{and}\quad \varepsilon< 0.08 ,$$

all assumptions are met:

  1. 0)

    The basic assumption that \(p_{jk}=\widehat p_{jk} -\varepsilon _{jk}\in [0,1]\).

  2. i)

    Assumption (6).

  3. ii)

    \({\sum }_{i=1}^{M} \widehat p_{ik}\ge 1+{\sum }_{i=1}^{M} \varepsilon _{ik}\) for any k = 1,2,3.

  4. iii)

    \({\sum }_{i\in y(k)} \widehat p_{ik}\ge 1 + {\sum }_{i \in y(k)} \varepsilon _{ik}\) for for k = 1,2 (y(3) = and then the condition does not make sense for k = 3).

  5. iv)

    Hypotheses (H1) and (H2).

5 The modified confidence level (MCL)

Although CL quantifies the uncertainty associated with class prediction and it is the usual measure of degree of support, it suffers from a shortcoming. Indeed, information provided by CL is very valuable but it could be insufficient to compare predictions made by classifiers in the multi-class setting. Suffice a toy example as argument. Imagine that we are in the 3-class setting and for two classifiers, the probability distributions associated to their predictions are, respectively:

$$(0.6, 0.4, 0.0)\quad\text{and}\quad (0.2, 0.6, 0.2) .$$

For the first classifier, the predicted class will be y1, with a confidence level of CL= 0.6, while the second classifier will provide y2, with the same confidence level. Can we choose objectively between them? Prediction with the second classifier seems more “reliable” (in the intuitive sense of being more dependable). Then, if we had to choose, intuitively we would choose y2 as class prediction, that is, we will prefer classification provided by the second classifier.

To formalize this intuition, since CL has proven unable to distinguish between the two predictions in the example, we will introduce a modification that can do it, and we name it the Modified Confidence Level (MCL). Then, MCL could also be used alternatively to CL in order to compare and choose among predictions made by different classifiers, and therefore to construct a combiner scheme that we name MCL-MV, which is like the already introduced CL-MV but substituting CL by MCL.

The Modified Confidence Level MCL is formally introduced as follows: consider a classifier that produces a r-dimensional vector (p1,…,pr) where pk is the probability the classifier adjudges to class yk (pk ≥ 0 for all k = 1,…,r and \({\sum }_{k=1}^{r} p_{k}=1\)). Then, the class predicted by the classifier is

$$ y^{*} = y_{\ell} \text{if} \ell=\arg \max_{k=1,\ldots,r} p_{k} , \text{ with confidence level } \text{CL} = \max_{k=1,\ldots,r} p_{k} . $$

Definition 1

In this setting, the Modified Confidence Level (MCL) is defined by:

(10)

where \({{\varDelta }}=\text {CL} - \widetilde {\text {CL}}\) is the margin of confidence, being \(\widetilde {\text {CL}} = \max \limits _{k=1,\ldots ,r : y_{k}\ne y^{*}} p_{k}\) the second maximum.

Justification: Note that CL + (1 −CL)Δ = (1 −Δ)CL + Δ. Formula (10) is obtained by searching for a weighted sum of CL and Δ with respective weights f(Δ) and 1, that is, MCL = f(Δ)CL + Δ, such that

  1. i)

    f is a linear function, that is, f(Δ) = aΔ + b,

  2. ii)

    If \(\text {CL}=\widetilde {\text {CL}} (\Rightarrow {{\varDelta }}=0)\), then MCL = CL,

  3. iii)

    If CL = 1(⇒Δ = 1), then MCL = 1,

  4. iv)

    CL ≤MCL ≤ 1.

Indeed, by i) and ii), we obtain that b = 1, and then, by using iii) we have that a = − 1, giving that f(Δ) = 1 −Δ. As by definition, \(0\le \widetilde {\text {CL}}\le \text {CL}\le 1\), we have that Δ ∈ [0,1] and then iv) holds. More specifically, MCL verifies:

$$0<\frac{1}{r}\le \text{CL} \le \text{MCL}\le 1$$

(see Appendix ?? for this and other properties of MCL).

Intuitively, we have introduced MCL as a modification of CL by adding a non-negative term Δ multiplied by 1 −CL, and it can be interpreted as a degree of support, different from CL, to the prediction y given by the classifier. The idea is that this new measure incorporates more information than CL about how reliable the prediction is. This definition is motivated by the fact that when there are more than one class with a high probability (that is, when \(\widetilde {\text {CL}}\) is close to CL), it seems reasonable to assume that we have less conviction when choosing the class with the highest probability, so to associate a degree of support with the prediction, we reward when it is done with a wide margin between the first and second candidates, and penalize otherwise.

For instance, a similar situation has already been considered in [17], where the second maximum is taken into account on the conditional probability distribution obtained from a Property Performance Bayesian Network, in order to select candidates that allow to find out the Virtual Machine with the minimal resource cost.

We can observe a parallelism between the MCL-MV combiner scheme and the uncertainty sampling method in the Active Learning setting (see [25]). Indeed, MCL-MV prefers the prediction of a base classifier which is less uncertain on how to label, uncertainty being measured through the margin of confidence. The higher the confidence margin with which the model predicts, the lower its uncertainty, then the more preferable the base classifier prediction will be.

Remark 3

Note that MCL coincides with CL if Δ = 0, and also that if MCL is a monotonically non-decreasing function of CL, therefore, \(y^{*}_{\text {MCL-MV}}=y^{*}_{\text {CL-MV}}\). By Proposition 4 below, we have then that CL-MV and MCL-MV schemes are equivalent combiners in the binary case, in the sense that they provide the same predictions.

Proposition 4

In binary classification (r = 2), MCL is a monotonically increasing function of CL.

Proof

Increase in the binary case is consequence of the fact that MCL as function of x =CL is

$$f(x)=-2 x^{2}+4 x-1$$

which is an increasing function of x in the interval where CL lives, [0.5,1]. Indeed, if we denote CL by x, \(\widetilde {\text {CL}}\) is then 1 − x, and therefore

$$ \begin{array}{@{}rcl@{}} \text{MCL}&=\text{CL}+(1-\text{CL}) {{\varDelta}}=\text{CL}+(1-\text{CL}) (\text{CL}-\widetilde{\text{CL}})\\ &=x+(1-x) (x-(1-x))=-2 x^{2}+4 x-1, \end{array} $$

and \(f^{\prime }(x)=-4 (x-1)>0\) if x < 1, which means that f(x) is strictly increasing, while \(f^{\prime }(x)=0\) if x = 1, case in which CL = MCL = 1.

And what happens in the multi-class context? Although in the toy example of Table 2, CL-MV gives the same prediction as MCL-CV, as can be seen in Table 4 below, it doesn’t have to be.

Table 4 Enlargement of the first toy example in Table 2

Indeed, in this example, \(y^{*}_{\text {CL-MV}} = y^{*}_{\text {MCL-MV}} = y_{2}\), that match. However, this might not be the case, as we can see in this other toy example, from Table 5 in which, with respect to the toy example in Table 4, only the probability distribution corresponding to classifier \({\mathcal C}_{4}\) changes.

Table 5 Second toy example

In this second toy example, the predictions are:

$$ \begin{array}{@{}rcl@{}} y_{\text{Sum}}^{*}&=&y_{\text{Average}}^{*}=y_{\text{Product}}^{*}=y_{\text{Minimum}}^{*}=y_{3},\\ y_{\text{Majority}}^{*}&=&y^{*}_{\text{CL-MV}} = y_{1},\quad y^{*}_{\text{MCL-MV}} = y_{\text{Maximum}}^{*}=y_{2}, \end{array} $$

and we can observe that CL-MV and MCL-MV provide different predictions.

6 Experimentation

It is worth mentioning here that “there is no single best classifier and that classifiers applied to different problems and trained by different algorithms perform differently” ([19]). For that, given the sources of variation which are imponderable when comparing classifiers, we follow the advice of [19], Section 1.4, and carry out the experiments with multiple training and validation sets, and with multiple runs, and perform statistical tests of hypotheses to compare Accuracy and MCC as behaviour metrics. More specifically, the experiments were carried out using different datasets to which we apply the bagging procedure, and we have validated and compared the obtained classifiers using the k-fold cross-validation procedure. Each experiment is repeated N = 10 times with different random seeds for the separation into the k folds of any dataset.

As decision trees are considered to be easy to implement state-of-the-art classifiers, their use is widely spread, and the cluster point of the methodology we introduce here does not lie in the type of base classifier used to ensemble, we decided to bag decision trees (through the C4.5 algorithm, using the function J48 provided by the R library RWekaFootnote 1). Note that in order to evaluate the predictive capacity of the ensembles, it is desirable that the base classifiers from which they are built have a not very high predictive power, which is achieved with decision trees.

First, we compare CL-MV and MCL-MV against the simple majority vote, and among them. Secondly, we compare CL-MV and MCL-MV against the usual average which has better behaviour than the majority vote and is computationally more demanding, and finally we compare them against the decision tree generated from the whole training dataset without bagging.

6.1 The datasets

Some datasets from small to moderate size have been considered in the experiments of this section, without claiming to be exhaustive. All the datasets are public and have been obtained from repositories of free access and proven prestige, such as the UCI machine learning repository [13] and Kaggle IncFootnote 2. See Table 6 for a brief list of the datasets and a summary of their main characteristics.

Table 6 datasets used in the experimentation phase

6.2 Experimental design

The experiments have been implemented by using programming with R language [22], with fixed seed for reproducibility purposes. The details are as follows: to avoid possible biases, the procedure is repeated N = 10 runs, with different seeds in the process of randomly splitting the dataset D into folds, and in any run the same strategy was used, which we break down into the following steps:

  • Step 1: preparation of the training/validation sets.

    For all the datasets we perform 10-fold cross-validation to evaluate the considered ensemble methods for bagging, as well as the decision tree algorithm C4.5 generated from the complete training dataset, denoted by DT in what follows. For that, we split the whole dataset D into 10 subsets or folds of length approximately equal, say V1,…,V10. For each = 1,…,10, let denote by T the complementary of V in D.

  • Step 2: learning DT.

    For each = 1,…,10, we use fold V as validation set for the DT model constructed from T as training set, from which we learn the decision tree algorithm C4.5. Then, for each we obtain the corresponding confusion matrix.

  • Step 3: bagging.

    For each fold = 1,…,10, and fixed a number of bags NBags (we have considered three possibilities: NBags = 5, 10, 100), we first choose from T as many samples drawn at random with replacement, of length the number of cases in T, as NBags, and denote them by TB1,…,TBNBags. Then, for each j = 1,…,NBags, we learn the decision tree algorithm C4.5 with TBj as training dataset, and from them the ensembles using the combiner schemes: simple majority vote, CL-MV, MCL-MV and average. Note that all of them have been built using the same base classifiers. Maximum, minimum and product combiners have also been considered although they show a very poor behaviour. Since the other four combiner schemes (majority vote, CL-MV, MCL-MV and average) clearly outweigh them in all the settings with the two considered metrics (accuracy and MCC), we have not included the details to lighten the section devoted to the Results (Section 6.3).

Further step: Validation

For each run n = 1,…,N, and for each fold = 1,…,10, we validate the ensembles of classifiers used for bagging with the validation set V, obtaining the corresponding confusion matrices. From these matrices, and the matrices obtained for the DT classifier, we compute Accuracy and MCC metrics. In addition, for each model and for n = 1,…,N, we compute the averages over = 1,…,10 of Accuracy and MCC. So, finally, we have a vector of length N for each model, metric and dataset, which forms a statistical sample of size N that we use for carry on the comparison between models using the appropriate statistical tests of hypotheses: pairwise Student’s t-tests or Wilcoxon signed-rank tests [28], its non-parametric counterpart, in case of lack of normality (after applying the Shapiro-Wilk test for normality [23] to discriminate whether we should apply a parametric or non-parametric methodology).

6.3 Results

The outcomes of the comparisons are in Tables 13 and 14 (Appendix ??), where only significant results have been recorded. As usual, superscript denotes statistical significance at 5%, ∗∗ at 1% and ∗∗∗ at 1‰, and ⋅ denotes weak significance at 10%. The alternative hypothesis, which is accepted for small p-values, is that reported in the table; for example, “MCL-MV > majority” indicates that bagging procedure with the MCL-MV scheme is better than that with the simple majority vote in the sense that the mean (or median, as appropriate) of any of the metrics is statistically significantly greater (the p-value of the corresponding statistical test is < 0.1), the accompanying p-value being the measure of statistical significance.

To facilitate interpretation of the results, a summary of Tables 13 and 14 is given in separate Tables: 78 and 9, where p-values are one-sided exact Binomial p-values, which are obtained as follows: for example, in Table 7 below, the comparison between MCL-MV and majority vote with NBags = 100 shows 4 datasets in favor of the first, and 1 in favor of the second, that is, of the 5 datasets for which there are significant differences between MCL-MV and majority vote, 4 are in favor of MCL-MV, resulting in a one-sided exact p-value of \(P(B(n=5, p=0.5)=4 )=\binom {5}{4}\times 0.5^{5}=0.15625 ,\) what would be the probability of obtaining such a result if there were really no differences between the two combiner schemes.

Table 7 Comparison summary of the predictive power: MCL-MV and CL-MV vs. majority vote
Table 8 Comparison summary of the predictive power: MCL-MV and CL-MV vs. average
Table 9 Comparison summary of the predictive power: average vs. majority vote, and MCL-MV vs. CL-MV

We observe that the main hypothesis (that CL-MV and MCL-MV outperform the majority vote) is supported by results in Table 7, while Table 9 gives support to secondary hypothesis that in some cases MCL-MV shows better predictive power than CL-MV, while in others, the opposite, and Tables 8 and 9 sustain the secondary hypothesis that although the average rule outperforms the majority vote, it does not outperform CL-MV or MCL-MV.

Note that results corresponding to the comparison against DT are not recorded since for most of the datasets considered in this work, bagging with both combiner schemes CL-MV, MCL-MV, majority vote or average, are significantly better than DT, as expected.

Also note that we have used 5,10 and 100 bootstrap replicates of the training datasets for bagging because it seemed reasonable to try a low number, a high number, and a middle number. We observe that for most data bases, increasing this number improves the predictive behavior of the classifiers obtained by bagging (see Table 16 in Appendix ??, where we record the average over the runs of the averages over the folds for both Accuracy and MCC metrics), and also that the differences between them decrease.

Finally, to address the issue of the computational complexity, measured by the amount of time required to run the algorithm, we have measured running times of R code for all performed computations corresponding to bagging with the different combiner schemes (see Table 15 in Appendix B, whose information is summarized in Tables 1011 and 12 below where, as for Tables 7-9, p-values are one-sided exact Binomial p-values). For that, we use functions tic and toc of the library tictocFootnote 3 to measure the run time of a chunk of code by taking the difference between the time at the start and at the end (elapsed times).

Table 10 Comparison summary of the running times: average vs. majority vote
Table 11 Comparison summary of the running times: MCL-MV and CL-MV vs. average
Table 12 Comparison summary of the running times: MCL-MV and CL-MV vs. majority

The results of these tables indicate that in terms of running times, bagging with the simple majority vote is less computationally demanding than with CL-MV or MCL-MV, which in turn is less computationally demanding than with the average combiner scheme, confirming the remaining part of the secondary hypotheses.

7 Discussion

“No free lunch” theorems (see [29]) tell us that for any algorithm, its performance being high in one context will be compensated for being low in a different one. Applied to ensembles of probabilistic classifiers, different combiner schemes are known to perform differently on different datasets, and therefore choosing an appropriate combiner scheme in each particular case is a major problem. To address this question, it is essential to understand the way in which classifier ensembles are combined.

In this paper we have used a common theoretical framework to ensemble multi-class probabilistic classifiers (see (1)), of which the most popular combiner schemes are a particular case. We have focused on the consideration of class-conscious combiner schemes, for which function gk is defined from probabilities assigned by the different base classifiers to class k, {pjk,j = 1,…,M}, and/or the predictions of these classifiers, \(\{y_{j}^{*}, j=1,\ldots ,M\}\). Out of our scope are the class-indifferent combiners, such as the decision templates (see [19]), which use the set of all the probabilities {pj,j = 1,…,M, = 1,…,r} (named there as “decision profile” DP) to define gk. A difference between the former and the latter is that the class-conscious combiners are idempotent, that is, applied to M copies of the same classifier, give the same decision than that of the classifier; however, this is not the case for decision templates, which can give a different decision. This characteristic of decision templates, whose predictive capacity may be better or worse than that of a class-conscious combiner, is difficult to justify from the point of view of intuition and applications: In a team whose members fully agree and make the same decision, how to explain that the team as a whole makes a different decision than its members?

As many empirical studies have shown that simpler class-conscious combiner schemes often work remarkably well ([18]), we concentrate on this category and only consider non-trainable rules, which do not need a separate training algorithm to find weights for the base classifiers, usually as a function of their estimated accuracies. We distinguish between hard and soft voting. Majority vote is a hard-voting scheme (based in the label outputs of the classifiers), perhaps the oldest and simpler strategy for decision making in a group, which is still the most common today.

From among the soft-voting combiner schemes polling the continuous outputs of the base classifiers to reach a decision, we consider the most popular choices: average, product, minimum and maximum. The minimum is the most pessimistic choice: a class is supported by the ensemble with a certain degree if all the classifiers members of the ensemble give a support of at least as much as this degree to that class, while at the other extreme, the maximum rule is the most optimistic, since a support degree is assigned by the ensemble to a class if at least one of the members supports that class with this degree; the average is an intermediate case between pessimism and optimism, and in general it is preferred. Besides, the product and the minimum combiners are oversensitive to probabilities close to zero: the presence of such probabilities for a given class has the effect of veto on that particular class regardless of how large the probabilities assigned by the other classifiers to this class might be, which, in general, prevents against their use.

The introduction in this paper of a novel class-conscious non-trainable combiner scheme named CL-MV (and its counterpart MCL-MV), which is halfway between the majority vote and the average schemes, defined as a weighted version of the former by using the confidence levels as weights, seemed natural and convenient, since non-trainable rules are simpler, well behaved and have lower computing needs. Indeed, with its simple definition, this combiner achieves greater accuracy than the majority vote in the binary case, and at the same time, in plausible scenarios, it is more resilient to probability estimation errors than both the average and the product schemes, hence it is preferable when probabilities assigned by classifiers to the classes are not computed correctly but they suffer from an estimation error.

Furthermore, comparing those combiner schemes through experimentation, we show by bagging that both CL-MV and MCL-MV are competitive with the average, being less computationally demanding, and outperform the rest of considered combiners, including the majority vote, so they are a good alternative to consider when making the choice of a combiner scheme.

8 Conclusions

The majority vote is an elementary combiner scheme that together with the average, it is still the most used in practice today to ensemble probabilistic classifiers. In this work we have introduced a non-trainable weighted version of the simple majority vote combiner scheme, CL-MV, that uses the confidence level that each base classifier gives to its prediction as weight (instead of use weights based on the accuracies of the base classifiers, what corresponds to a trainable weighted majority vote scheme), and can be thought as a semi-hard voting combiner scheme, halfway between the majority vote (hard voting) and the average (soft voting) combiners. From a theoretical point of view, we have proved the following results, which could be a plausible explanation of its good performance observed in the experimentation phase:

  • In the binary case, CL-MV is more accurate than the majority vote scheme.

  • In the multi-class setting, under reasonable hypotheses, both the average rule and the CL-MV are more resilient to estimation errors than the product rule, being CL-MV even more resilient than the average rule.

That is, CL-MV is a semi-hard voting rule that improves the accuracy of the hard voting and the resilience to estimation errors of the considered soft voting combiner schemes.

We also introduce another simple measure of the degree of support that each base classifier gives to its prediction, alternative to CL in the multi-class setting, and we name it MCL (by Modified Confidence Level), which embodies more information than the usual CL since it incorporates some knowledge about the information involved in the probability distribution over the classes. More specifically, MCL is based on the difference between the maximum and the second maximum of the probability distribution.

The results of our experiments, using fifteen datasets from the UCI machine learning repository, are very encouraging since with the meta-algorithm bagging, we have given heuristic support to the hypotheses that we had raised at the beginning of the work, reaching the following statements, with both accuracy and MCC as performance metrics:

  • MCL-MV and CL-MV give similar performance results, and each of them is better for some of the databases used in our experimental work.

  • Both MCL-MV and CL-MV improve the classifying power of the simple majority vote.

  • With both MCL-MV and CL-MV, bagging outperforms the base classifiers (each a single decision tree algorithm C4.5).

  • Bagging with the average rule outperforms that with the simple majority vote, but it is equivalent to use MCL-MV or CL-MV.

Moreover, we also evaluate the computational complexity of the algorithms, reaching the conclusion that time complexity is higher for bagging using the average, and lower for the majority vote, positioning MCL-MV and CL-MV combiners at an intermediate point.

Therefore, we can say from the experimental evidence that both CL-MV and MCL-MV are combiner schemes preferable for bagging to the simple majority vote (the usual) and also to the average (its natural competitor), conclusion that is in line with the theoretical results that have been proved.