A semi-hard voting combiner scheme to ensemble multi-class probabilistic classifiers

Delgado, Rosario

doi:10.1007/s10489-021-02447-7

A semi-hard voting combiner scheme to ensemble multi-class probabilistic classifiers

Published: 09 July 2021

Volume 52, pages 3653–3677, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

A semi-hard voting combiner scheme to ensemble multi-class probabilistic classifiers

Download PDF

Rosario Delgado ORCID: orcid.org/0000-0003-1208-9236¹

571 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

Ensembling of probabilistic classifiers is a technique that has been widely applied in classification, allowing to build a new classifier combining a set of base classifiers. Of the different schemes that can be used to construct the ensemble, we focus on the simple majority vote (MV), which is one of the most popular combiner schemes, being the foundation of the meta-algorithm bagging. We propose a non-trainable weighted version of the simple majority vote rule that, instead of assign weights to each base classifier based on their respective estimated accuracies, uses the confidence level CL, which is the standard measure of the degree of support that each one of the base classifiers gives to its prediction. In the binary case, we prove that if the number of base classifiers is odd, the accuracy of this scheme is greater than that of the majority vote. Moreover, through a sensitivity analysis, we show in the multi-class setting that its resilience to the estimation error of the probabilities assigned by the classifiers to each class is greater than that of the average scheme. We also consider another simple measure of the degree of support that incorporates additional knowledge of the probability distribution over the classes, namely the modified confidence level MCL. The usefulness for bagging of the proposed weighted majority vote based on CL or MCL is checked through a series of experiments with different databases of public access, resulting that it outperforms the simple majority vote in the sense of a statistically significant improvement regarding two performance measures: Accuracy and Matthews Correlation Coefficient (MCC), while holding up against the average combiner, which majority vote does not, being less computationally demanding.

A Weighted Majority Vote Strategy Using Bayesian Networks

A geometric framework for multiclass ensemble classifiers

Article Open access 27 September 2023

Untrained Method for Ensemble Pruning and Weighted Combination

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

1.1 Ensemble of probabilistic classifiers

Classification is one of the main tasks of Supervised Learning. Given a dataset consisting of information about objects relating to some attributes that describe them, and to a categorical output variable (the class to which each object belongs), with r ≥ 2 different possible classes y₁,…,y_r, a classifier is an algorithm that allows to infer the class of a new object or instance from its known attributes. Different methodologies are used in Machine Learning to learn classifiers from a dataset of solved cases in which both, the attributes and the class, are consigned for each of the instances (except missing data). Among them, we are interested in probabilistic classifiers, which not only predict the class, but estimate the probability distribution over the set of classes, being the predicted class that one with the highest associated probability, that is, the most likely class that the instance should belong to, following the maximum a posteriori (MAP) criterium.

Probabilistic classifiers provide a prediction that can be useful in its own right, and particularly when classifiers are combined to create ensembles. Ensemble of classifiers (also known as “combining classifiers”, see [18]) is a technique that has been widely applied in classification learning, the idea being to build a new classifier combining a set of base classifiers, in the hope of improving their behaviour, as it effectively emerges from different works (see [5, 11, 16, 18, 27]). Obviously, an important research topic in this field is that of combination schemes: their comparison and how to choose between them, as well as the type of base classifier used to build the ensemble. For example, in [20] the authors experimentally prove that the ensembles of Naive Bayes classifiers following different schemes are significantly better than standard Naive Bayes, and also slightly better than an ensemble of Naive Bayes and decision tree.

1.2 Bagging

As single decision classifiers can suffer from high variance, a simple way to address this flaw is to use them in the context of randomization-based ensemble methods, by introducing random perturbations into the learning procedure in order to produce several different classifiers from a single training set. It is the case of bagging, an acronym for Bootstrap AGGregatING, as it was introduced by Breiman in [6]. Indeed, bagging is a meta-classifier used to reduce the variance of a base classifier by introducing randomization into its construction through the use of learning datasets obtained as bootstrap samples of the original learning set, that is, fits base classifiers, each on random subsets of the original dataset drawn with replacement, and then aggregate their individual predictions making an ensemble out of them by means of the majority vote combiner scheme. With this procedure we lose the interpretability of a single simple classifier, but potentially gain in predictive power.

In many cases, bagging is a simple way to improve the predictive power of a single model without needing to change the underlying classifier. The improvement on it, using the bagging procedure, is obtained for unstable underlying classifiers, as is the case of a decision tree (see [6]), which will be the one we will use in our experimental phase. Paraphrasing Breiman ([6]), we can say that “The evidence[...] is that bagging can push a good but unstable procedure a significant step towards optimality. On the other hand, it can slightly degrade the performance of stable procedures.”.

Random forest, which are outstanding examples of probabilistic classifiers, were also introduced by Breiman in [7], as a variant of bagging in which a randomization of the input attributes is used when considering candidates to split internal nodes (see also [26]). In addition, ensemble of classifiers is also the main idea behind boosting (see for instance [14] for general description of boosting and [10] for an application in the field of medicine). A comparison of the effectiveness of randomization, bagging and boosting to improve the performance of the decision tree algorithm C4.5 can be found in [12].

1.3 Combiner schemes

In general, we build M ≥ 2 different base probabilistic classifiers, ${\mathcal C}_{1},\ldots , {\mathcal C}_{M}$, which may or may not correspond to the bagging meta-algorithm, and then combine their outputs to construct an ensemble. The final decision of the ensemble is derived using a combination rule, that can fall into one of following two groups (see for instance [30]):

i)
Hard voting: combination rules that apply to class labels, as the simple majority vote, which is just by going with the prediction that appears the most times in the base classifiers, and is the one used by bagging. The criterion of the majority vote scheme coincides, in the binary setting, with the classical Condorcet criterion, according to which to be the winner a class must win one-on-one matches with all other classes, that is, must be preferred to each other class when compared to them one at a time.
ii)
Soft voting: combination rules obtained by polling the continuous outputs of each base classifier using a function (average, maximum, minimum, product,... see [18, 19]) that returns the class label that maximizes the value of the function applied to the predicted probabilities, the average being the strongest from a viewpoint of predictive power. The average combiner is a natural competitor of the majority vote for bagging, showing similar results in the experimental phase of [6] (Section 6.1). Our experimental evidence is different, however, as we will see in Section 6.3, finding confirmation that the average scheme outperforms the majority vote, although at the price of a greater computational requirement.

The combiner scheme that we will introduce in this work is halfway between the majority vote and the average, we will see latter in what sense, so it can be considered a semi-hard voting scheme.

Other possible grouping of the combination rules is trainable vs. non-trainable. The simple majority vote and the average are non-trainable, but their usual weighted counterpart are trainable, since the weights are determined through a separate training algorithm, usually as a function of the estimated accuracies of the base classifiers. Using non-trainable rules provides simplicity and is less memory and computationally demanding.

The combiner scheme that we propose is non-trainable although it is a weighted version of the simple majority vote, because its weights are obtained just as a measure of the degree of support that any of the base classifiers assigns to the class it predicts. As measure of the degree of support we propose the confidence level CL, which for each base classifier is the degree of support to its own choice. But it could also be natural to think about assigning another alternative measure, namely the modified confidence level (MCL), which takes into account additional knowledge of the probability distribution between classes. The corresponding non-trainable weighted versions of the simple majority vote scheme are named in what follows: CL-MV and MCL-MV, for Confidence Level (respectively, Modified Confidence Level) based Majority Vote.

From a heuristic perspective, our contribution consists of supporting the following hypotheses through experimentation with different datasets:

Main hypothesis: using CL-MV (or MCL-MV) instead of simple majority vote, the obtained scheme significantly improves the classifying power of bagging.
Secondary hypotheses:
- Although there is no clear significant difference between them, in some cases the MCL-MV scheme gives better results than CL-MV (and in some others, the opposite).
- Although bagging with the average rule outperforms that with the simple majority vote, the same does not happen with CL-MV nor MCL-MV, which hold up and do not show statistically significant differences with the average, while they are computationally less demanding.

And from a theoretical perspective, our main contribution is to obtain two important results on the combiner schemes comparison:

In the binary case, we prove that if the number of base classifiers is odd, the accuracy of CL-MV is greater than that of the majority vote, if the probability for each base classifier to give correct label is big enough (Section 3).
In the general multi-class setting, we perform a sensitivity analysis (Section 4) showing that under some reasonable hypothesis, CL-MV is more resilient to probability estimation errors than both the average and the product rules.

1.4 Measures of performance comparison

To compare the predictive power of the different combiner schemes, we will perform a series of experiments using real datasets following the bagging procedure. We denote by C = (C_ij)_{i,j= 1,…,r} a general confusion matrix, with C_ij being the number of instances in the testing dataset that belong to class j and have been assigned to class i by the classifier, and we compare the goodness of the ensembles as classifiers on unseen data in the validation process using two different performance measures that can be applied both in the binary and in the multi-class setting:

Accuracy: this performance measure is one of the most intuitive and appealing, and is defined from C in this way:

$$\text{Accuracy}=\frac{{\sum}_{i=1}^{r} C_{ii}}{{\sum}_{i=1}^{r} {\sum}_{j=1}^{r} C_{ij}} .$$

Accuracy ranges between 0 and 1, the latter corresponding to perfect classification.

Matthews Correlation Coefficient (MCC): is a more subtle performance measure, which was first introduced in the binary case by B.W. Matthews [21] as a measure of association obtained by discretization of the Pearson’s correlation coefficient for two binary vectors, and was generalized in [15] to multi-class classification. MCC has proven to be more reliable as a metric for classification than Cohen’s Kappa, which has been used in many works for a long time (see [9]). Its definition is as follows:

$$ \text{MCC}=\frac{\sum\limits_{k,\ell,m=1}^{r} (C_{kk} C_{\ell m}-C_{mk} C_{k\ell})} {\sqrt{\sum\limits_{k=1}^{r} \left( \left( \sum\limits_{\ell=1}^{r} C_{k\ell}\right)\left( \sum\limits_{u,v=1, u\ne k}^{r} C_{uv}\right)\right)} \sqrt{\sum\limits_{k=1}^{r} \left( \left( \sum\limits_{\ell=1}^{r} C_{\ell k}\right)\left( \sum\limits_{u,v=1, u\ne k}^{r} C_{vu}\right)\right)}} $$

MCC also assumes its theoretical maximum value of 1 when classification is perfect, but ranges between − 1 and 1.

In both cases, the larger the metric value, the better the classifier performance.

1.5 Description of the sections

The remainder of the paper is structured as follows. After introducing the CL-MV combiner scheme in Section 2, in Section 3 we compare the accuracies of CL-MV and the majority vote in the binary case, and in Section 4 we investigate the sensitivity of the average, product and CL-MV schemes to probability estimation errors, in the general multi-class setting. The Modified Confidence Level MCL is introduced in Section 5 as a degree of support in classification alternative to the confidence level CL, showing some properties in Appendix ??. Without diving into computationally demanding experiments, Section 6 describes the used datasets, the experimental design aimed to compare the predictive power and the computational complexity of the considered combiner schemes for bagging, and the obtained results (of which some complementary tables are allocated in Appendix ??). We finish with a discussion in Section 7, and some words by way of conclusion in Section 8.

2 The CL-MV combiner scheme

We build a novel ensemble classifier from M base classifiers ${\mathcal C}_{1}, \ldots , {\mathcal C}_{M}$, by introducing a modification of the simple majority vote scheme, and we name it the Confidence Level based Majority Vote, CL-MV. This combiner scheme uses the classifications given by the base classifiers themselves along with their corresponding estimated probability distributions, in this way:

If p_jk denotes the probability that classifier j assigns to class k, the predicted class by this classifier following the maximum a posteriori probability (MAP) criterium, is that with the largest assigned probability. That is, the predicted class by the j-th probabilistic classifier is

$$ y^{*}_{j}=y_{\ell} \text{if} \ell=\arg \max_{k=1,\ldots,r} p_{jk} ,\quad \text{and}\quad \text{CL}_{j}= \max_{k=1,\ldots,r} p_{jk} $$

is said to be the confidence level of the prediction, which is interpreted as a degree of support to it. (It is understood that if the point where the maximum is reached is not unique, one of them is chosen by a tie-breaking rule.) In general, a combiner scheme predicts in the following way:

$$ y^{*}_{ensemble}=y_{\ell}\quad\text{with}\quad \ell= \arg \max_{k=1,\ldots,r} g_{k} $$

(1)

for some function g_k. Consider the following particular cases:

1.
Majority vote: $g_{k}={\sum }_{j=1}^{M} d_{jk}, \text {where} d_{jk}=\begin {cases} 1 &\text { if } y_{j}^{*}=y_{k}\\ 0 & \text { otherwise.} \end {cases}$

Weighted majority vote: $g_{k}={\sum }_{j=1}^{M} \omega _{j} d_{jk}, \text {where} d_{jk}$ is as in the simple majority vote, and ω_j is the weight assigned to classifier ${\mathcal C}_{j}$, usually obtained from its estimated accuracy (trainable combiner).
2.
Average: $g_{k}=\frac {1}{M} {\sum }_{j=1}^{M} p_{jk}$ (the Sum combiner is equivalent, with the same g_k but without dividing by M).
3.
Product: $g_{k}={\prod }_{j=1}^{M} p_{jk}$
4.
Minimum: $g_{k}=\min \limits _{j=1,\ldots ,M} p_{jk}$
5.
Maximum: $g_{k}=\max \limits _{j=1,\ldots ,M} p_{jk}$

We propose the following non-trainable weighted version of the majority vote combiner scheme, based on the confidence level CL as it degree of support:

6.
CL-MV: $g_{k}={\sum }_{j=1}^{M} \omega _{j} d_{jk}$, with d_jk as in the majority vote and ω_j = CL_j.

Note that ω_j needs no other separate training algorithm to be learned, making CL-MV a non-trainable combiner scheme, and that for each classifier, it only uses the maximum of its probability distribution, unlike the average, which uses all the values of the probability distribution. Also note that by definition of d_jk, g_k only depends on the value of the weight ω_j for those j such that d_jk = 1, that is, when $y_{j}^{*}=y_{k}$, and we can assume without loss of generality that otherwise ω_j = 0. For that, we introduce the notation $y(k)=\{j=1,\ldots ,M : y_{j}^{*}=y_{k}\}$ for any k = 1,…,r, and then, with this notation, we can rewrite:

1.
Majority vote: g_k = #y(k).

Weighted majority vote: $g_{k}={\sum }_{j\in y(k)} \omega _{j}$ where ω_j is the weight assigned to classifier ${\mathcal C}_{j}$ by a separate learning algorithm.
6.
CL-MV: $g_{k}={\sum }_{j\in y(k)} \omega _{j}$ where ω_j = CL_j.

(Here and in the sequel, we use the convention that a sum over an empty set, is zero.)

Remark 1

On the binary case (r = 2), we see in Proposition 1 below that under certain circumstances, CL-MV matches the average scheme.

Proposition 1

In binary classification (r = 2), suppose that $y_{\text {CL-MV}}^{*}=y_{k}$. Therefore, if #y(k) ≤ M/2, we can ensure that $y_{Average}^{*}=y_{k}$, that is, CL-MV and the average schemes give the same prediction.

Before giving the proof, let’s look at two examples in Table 1, with M = 5 classifiers and r = 2 classes, where the probabilities p_jk are listed: in example a), the hypothesis and the thesis of Proposition 1 are fulfilled, in example b), neither the hypothesis nor the thesis are (it is a counterexample that without the hypothesis, the thesis is no longer true).

Table 1 Illustrative examples of Proposition 1

A semi-hard voting combiner scheme to ensemble multi-class probabilistic classifiers

Abstract

Similar content being viewed by others

A Weighted Majority Vote Strategy Using Bayesian Networks

A geometric framework for multiclass ensemble classifiers

Untrained Method for Ensemble Pruning and Weighted Combination

Explore related subjects

1 Introduction

1.1 Ensemble of probabilistic classifiers

1.2 Bagging

1.3 Combiner schemes

1.4 Measures of performance comparison

1.5 Description of the sections

2 The CL-MV combiner scheme

Remark 1

Proposition 1

Proof Proof of Proposition 1

Remark 2

3 Accuracy in the binary case

Proposition 2

Proof

4 Error sensitivity

Proposition 3

Proof

5 The modified confidence level (MCL)

Definition 1

Remark 3

Proposition 4

Proof

6 Experimentation

6.1 The datasets

6.2 Experimental design

Further step: Validation

6.3 Results

7 Discussion

8 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix : A: Properties of MCL

Proposition 5

Proof

Corollary 1

Proof

Appendix B: Complementary tables

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation