The robustness of majority voting compared to filtering misclassified instances in supervised classification tasks

Smith, Michael R.; Martinez, Tony

doi:10.1007/s10462-016-9518-2

The robustness of majority voting compared to filtering misclassified instances in supervised classification tasks

Published: 22 September 2016

Volume 49, pages 105–130, (2018)
Cite this article

Download PDF

Access provided by CONRICYT – Journals CONACYT

Artificial Intelligence Review Aims and scope Submit manuscript

The robustness of majority voting compared to filtering misclassified instances in supervised classification tasks

Download PDF

Michael R. Smith¹ &
Tony Martinez¹

494 Accesses
13 Citations
Explore all metrics

Abstract

Removing or filtering outliers and mislabeled instances prior to training a learning algorithm has been shown to increase classification accuracy, especially in noisy data sets. A popular approach is to remove any instance that is misclassified by a learning algorithm. However, the use of ensemble methods has also been shown to generally increase classification accuracy. In this paper, we extensively examine filtering and ensembling. We examine 9 learning algorithms individually and ensembled together as filtering algorithms as well as the effects of filtering in the 9 chosen learning algorithms on a set of 54 data sets. We compare the filtering results with using a majority voting ensemble. We find that the majority voting ensemble significantly outperforms filtering unless there are high amounts of noise present in the data set. Additionally, for most cases, using an ensemble of learning algorithms for filtering produces a greater increase in classification accuracy than using a single learning algorithm for filtering.

A geometric framework for multiclass ensemble classifiers

Article Open access 27 September 2023

Removing Bias from Diverse Data Clusters for Ensemble Classification

Wisdom of Crowds: An Empirical Study of Ensemble-Based Feature Selection Strategies

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The goal of supervised machine learning is to induce an accurate generalizing function $\mathcal {F}: X \mapsto Y$ from a set of input feature vectors $X =\{x_1, x_2, \ldots ,x_n\}$ and a corresponding set of of label vectors $Y =\{y_1, y_2, \ldots , y_n\}$. The quality of the induced function $\mathcal {F}$ by a learning algorithm is dependent on the quality of the data used for training. However, many real-world data sets are inherently noisy where the noise in a data set can be label noise and/or attribute noise. Label noise has been shown to be more detrimental than attribute noise (Zhu and Wu 2004) and is the focus of this paper. Noise can arise from various sources such as subjectivity, human errors, and sensor malfunctions. Most learning algorithms are designed to tolerate a certain degree of noise by avoiding overfitting the training data. There are two general approaches for handling class noise: (1) creating learning algorithms that are robust to noise such as the C4.5 algorithm for decision trees (Quinlan 1993) and (2) preprocessing the data prior to inducing a model of the data such as filtering (Wilson 1972; Brodley and Friedl 1999), weighting (Rebbapragada and Brodley 2007; Smith and Martinez 2014) or correcting (Teng 2003) noisy instances.

Previous works have generally examined filtering in a limited context using a single or very few learning algorithms and/or using a limited number of data sets. This may be in part due to the extra computational requirement to first filter a data set and then induce a model of the data using the filtered data set. As such, previous works were generally limited to investigating relatively fast learning algorithms such as decision trees (John 1995) and nearest-neighbor algorithms (Tomek 1976; Wilson and Martinez 2000). In addition, filtering prior to using instance-based learning algorithms was motivated in part to reduce the number of instances that have to be stored and because instance-based learning algorithms are more sensitive to noise than other learning algorithms. Most previous works also added artificial noise to the data set to show that filtering, weighting, or cleaning the data set is beneficial (5–50 % of the instances become noisy). In this work, we examine filtering misclassified instances and using a majority voting ensemble on a set of 54 data sets and 9 learning algorithms without adding artificial noise. The artificial noise was added in previous works to show that filtering/weighting/cleaning provided significant improvements with noisy data sets. Within the context of the benefits of filtering established by the previous work, we examine the extent to which filtering affects the performance of a learning algorithm without adding artificial noise to a data set. This also avoids making assumptions about the generation of the noise which may or may not be accurate. It also shows the effect of filtering on the inherent noise in real-world data sets that is not known before hand.

The results provide insights on the robustness of a majority voting ensemble and when to employ a misclassification filter. Using a larger number of data sets allows for more statistical confidence in the results than if only a small number of data sets are used. We find that, in general, a voting ensemble is robust to noise and achieves significantly higher classification accuracy trained on unfiltered data than a single learning algorithm trained on filtered data. For filtering, we find that using an ensemble filter achieves significantly higher classification accuracy than using a single learning algorithm filter. On data sets with higher percentages of inherent noisy instances, however, using the ensemble filter achieves higher classification accuracy than a voting ensemble for some learning algorithms. Training a voting ensemble on filtered training data significantly decreases classification accuracy compared to training a voting ensemble on unfiltered training data. This is likely due to a reduction of diversity in the induced models of the ensemble.

In the next section, we present previous works for handling noise in supervised classification problems. A mathematical motivation for filtering misclassified instances is presented in Sect. 3. We then present our experimental methodology in Sect. 4 followed by a presentation of the results in Sect. 5. In Sect. 6 we provide conclusions and directions for future work.

2 Related work

As many real-wold data sets are inherently noisy, most learning algorithms are designed to tolerate a certain degree of noise. Typically, learning algorithms are designed to be somewhat robust to noise by making a trade-off between the complexity of the induced model and optimizing the induced function on the training data to prevent overfit. Some techniques to avoid overfit include early stopping using a validation set, pruning (such as in the C4.5 algorithm for decision trees Quinlan 1993), or regularization by adding a complexity penalty to the loss function Bishop and Nasrabadi (2006). Some previous works have examined how class noise and attribute noise affects the performance of various learning algorithms (Zhu and Wu 2004; Nettleton et al. 2010) and found that class noise is generally more harmful than attribute noise and that noise in the training set is more harmful than noise in the test set. Further, some learning algorithms have been adapted specifically to better handle label noise. For example, noisy instances are problematic for boosting algorithms (Schapire 1990; Freund 1990) where more weight is placed upon misclassified instances, which often include mislabeled and noisy instances. To address this, Servedio (2003) presented a boosting algorithm that does not place too much weight on any single training instance. For support vector machines, Collobert et al. (2006) use the ramp-loss function to place a bound on the maximum penalty for an instance that lies on the wrong side of the margin. Lawrence and Schölkopf Lawrence and Schölkopf (2001) explicitly model the possibility that an instance is mislabeled using a generative model and then use expectation maximization to update the probability that an instance is mislabeled.

Preprocessing the data set is another approach that explicitly handles label noise. This can be done by removing noisy instances, weighting the instances, or correcting incorrect labels. All three approaches first attempt to identify which instances are noisy by various criteria. Filtering noisy instances has received much attention and has generally resulted in an increase in classification accuracy (Gamberger et al. 2000; Smith and Martinez 2011). One frequently used filtering technique removes any instance that is misclassified by a learning algorithm (Wilson 1972) or set of learning algorithms (Brodley and Friedl 1999). Verbaeten and Van Assche (2003) further pursued the idea of using an ensemble for filtering using ideas from boosting and bagging. Other approaches use learning algorithm heuristics to remove noisy instances. Segata et al. (2009), for example, remove instances that are too close or on the wrong side of the decision surface generated by a support vector machine. Zeng and Martinez (2003) remove instances while training a neural network that have a low probability of being labeled correctly where the probability is calculated using the output from the neural network. Filtering has the potential downside of discarding useful instances. However, it is assumed that there are significantly more non-noisy instances and that throwing away a few correct instances with the noisy instances will not have a negative impact on a large data set.

Weighting the instances in a training set has the benefit of not discarding any instances. Rebbapragada and Brodley (2007) weight the instances using expectation maximization to cluster instances that belong to a pair of the classes. The probabilities between classes for each instances is compiled and used to weight the influence of each instance. Smith and Martinez (2014) examine weighting the instances based on their probability of being misclassified.

Similar to weighting the training instances, data cleaning does not discard any instances, but rather strives to correct the noise in the instances. As in filtering, the output from a learning algorithm has been used to clean the data. Automatic data enhancement (Zeng and Martinez 2001) uses the output from a neural network to correct the label for training instances that have a low probability of being correctly labeled. Polishing (Teng 2000, 2003) trains a learning algorithm (in this case a decision tree) to predict the value for each attribute (including the class). The predicted (i.e. corrected) attribute values for the instances that increase generalization accuracy on a validation set are used instead of the uncleaned attribute values.

We differ from the related work in that we do not add artificial noise to the data sets when we examine filtering. Thus, we avoid making any assumptions about the noise source and focus on the noise inherent in the data sets. We also examine the effects of filtering on a larger set of learning algorithms and data sets providing more significance to the generality of the results.

3 Modeling class noise in a discriminative model

Lawrence and Schölkopf (2001) proposed to model a data set probabilistically using a generative model that models the noise process. They assume that the joint distribution $p(x,y,\hat{y})$ (where x is the set of input features, $\hat{y}$ is the observed, possibly noisy, class label given in the training set, and y is the actual unkown class label) is factorized as $p(\hat{y}|y)p(x|y)p(y)$ as shown in Fig. 1a. However, since modeling the prior distribution of the unobserved random variable y is not feasible, it is more practical to estimate the prior distribution of $p(\hat{y})$ with some assumptions about the class noise as shown in Fig. 1b.

Here, we follow the premise of Lawrence and Schölkopf by explicitly modeling the possibility that an instance is misclassified. Rather than using a generative model, though, we use a discriminative model since we are focusing on classification tasks and do not require the full joint distribution. Also, discriminative models have been shown to yield better performance on classification tasks (Ng and Jordan 2001). Using a discrimintative model that accounts for class noise motivates our investigation of filtering and using a majority voting ensemble.

Let T be a training set composed of instances $\langle x_i, \hat{y}_i\rangle $ drawn i.i.d. from the underlying data distribution $\mathcal {D}$. Each instance is composed of an input vector $x_i$ with a corresponding possibly noisy label vector $\hat{y}_i$. Given the training data T, a learning algorithm generally seeks to find the most probable hypothesis h that maps each $x_i \mapsto \hat{y}_i$. For supervised classification problems, most learning algorithms maximize $p(\hat{y}_i|x_i,h)$ for all instances in T. This is shown graphically in Fig. 2a where the probabilities are estimated using a discriminative approach such as a neural network or a decision tree to induce a hypothesis of the data. Using Bayes’ rule and decomposing T into its individual constituent instances, the maximum a posteriori hypothesis is:

$$\begin{aligned} \mathop {\hbox {argmax}}\limits _{h \in \mathcal {H}} p(h|T)&=\frac{p(T|h)p(h)}{p(T)} \nonumber \\&\propto \prod _i p(x_i,\hat{y}_i|h)p(h)\nonumber \\ \mathop {\hbox {argmax}}\limits _{h \in \mathcal {H}} p(h|T)&=\prod _i p(\hat{y}_i|x_i,h) p(x_i|h)p(h). \end{aligned}$$

(1)

In Eq. 1, the MAP hypothesis h is found by finding a global optima where all instances are included in the optimization problem. However, noisy instances are often detrimental for finding the global optima since they are not representative of the true (and unknown) underlying data distribution $\mathcal {D}$. The possibility of label noise is not explicitly modeled in this form—completely ignoring $y_i$. Thus, label noise is generally handled by avoiding overfit such that more probable, simpler hypotheses are preferred (p(h)). The possibility of label noise can be modeled explicitly by including the latent random variable $y_i$ with $x_i$ and $\hat{y}_i$. Thus, an instance is the triplet $\langle x_i, \hat{y}_i, y_i\rangle $ and a supervised learning algorithm seeks to maximize $p(\hat{y}_i|x_i,y,h)$—modeled graphically in Fig. 2b. Using the model in Fig. 2b, the MAP hypothesis becomes:

$$\begin{aligned} \mathop {\hbox {argmax}}\limits _{h \in \mathcal {H}} p(h|T)&\propto \prod _i p(x_i,y_i,\hat{y}_i|h)p(h)\nonumber \\&=\prod _i p(\hat{y}_i|x_i,y_i,h)p(y_i|x_i,h) p(x_i|h)p(h). \end{aligned}$$

(2)

Equation 2 shows that for an instance $x_i$, the probability of an observed class label ($p(\hat{y}_i|x_i,y_i,h)$) should be weighted by the probability of the actual class ($p(y_i|x_i,h)$).

What we are really interested in is the probability that $y_i = \hat{y}_i$. Using a discriminative model h trained on T, we can calculate $p(y_i|\hat{y}_i,x_i,h)$ as

$$\begin{aligned} p(y_i|\hat{y_i},x_i,h)&= p(y_i|\hat{y}_i, h)p(\hat{y}_i|x_i,h)p(h). \end{aligned}$$

Since the quantity $p(y|\hat{y}_i, h)$ is unknown, $p(y_i|\hat{y}_i,x_i,h)$ can be approximated as $p(\hat{y}_i|x_i,h)$ assuming that $p(y_i|\hat{y}_i, h)$ is represented in h. In other words, the induced discriminative model is able to model if one class label is more likely than another class label given an observed, possibly noisy, label. Otherwise, all class labels are assumed to be equally likely given an observed label. Thus, $p(y_i|\hat{y}_i,x_i,h)$ can be approximated by finding the class distributions for a given $x_i$ from an induced discriminative model. That is, after training a learning algorithm on T, the class distribution for an instance $x_i$ can be calculated based on the output from the learning algorithm. As shown in Eq. 1, $p(\hat{y}_i|x_i,h)$ is found naturally through a derivation of Bayes’ law. The quantity $p(\hat{y}_i|x_i,h)$ is the maximum likelihood of an instance given a hypothesis h which a learning algorithm tries to maximize for each instance. Further, the dependence on a specific h can be removed by summing over all possible hypotheses h in $\mathcal {H}$ and multiplying each $p(\hat{y}_i|x_i,h)$ by p(h):

$$\begin{aligned} p(y_i|\hat{y}_i,x_i)\approx p(\hat{y}_i|x_i) = \sum _{h\in \mathcal {H}} p(\hat{y}_i|x_i,h)p(h). \end{aligned}$$

(3)

This formulation is infeasible though because (1) it is not practical (or possible) to sum over the set of all hypotheses, (2) calculating p(h) is non-trivial, and 3) not all learning algorithms produce a probability distribution. These limitation make probabilistic generative models attractive, such as the kernel Fisher discriminant algorithm (Lawrence and Schölkopf 2001). However, for classification tasks, generative models generally have a higher asymptotic error than discriminative models (Ng and Jordan 2001). The following section shows how we estimate $p(y_i|\hat{y}_i,x_i,h)$.

This framework for modeling class noise in a discriminative model motivates the use of removing instances with low $p(y_i| \hat{y}_i, h)$ and the use of ensembles to lessen the dependence on a given hypothesis h. Following Eq. 2, removing instances with low $p(y_i| x_i, h)$ will increase the global p(h|T) since $p(\hat{y}_i|x_i,y_i,h)$ will be low. Further, following Eq. 3, an ensemble should theoretically be more robust to the bias of a particular hypothesis and noise by utilizing multiple overfit avoidance techniques. This motivates our examination of a majority voting ensemble composed of models induced by different learning algorithms^{Footnote 1} as a filtering technique and as a classifier.

4 Methodology

In this section, we present how we calculate $p(y_i|\hat{y}_i, x_i, h)$ and the learning algorithms and data set that we use in our analysis. We also provide an overview of our experimentss.

4.1 Calculating $p(y_i|\hat{y}_i,x_i,h)$

To calculate $p(y_i|\hat{y}_i,x_i,h)$ for each instance, we use an induced model from training a learning algorithm on the training set T. To lessen the dependence of $p(y_i|\hat{y}_i,x_i,h)$ on a particular h, we estimate marginalizing over the hypothesis space $\mathcal {H}$ by selecting a diverse set of learning algorithms to represent $\mathcal {H}$. The diversity of the learning algorithm refers to the learning algorithms not having the same classification for all of the instances and is determined using unsupervised meta-learning (UML) (Lee and Giraud-Carrier 2011). UML first uses Classifier Output Difference (COD) (Peterson and Martinez 2005) to measure the diversity between learning algorithms. COD measures the distance between two learning algorithms as the probability that the learning algorithms make different predictions. UML then clusters the learning algorithms based on their COD scores with hierarchical agglomerative clustering. We considered 20 learning algorithms from Weka with their default parameters (Hall et al. 2009). The resulting dendrogram is shown in Fig. 3, where the height of the line connecting two clusters corresponds to the distance (COD value) between them. A cut-point of 0.18 was chosen to create 9 clusters and a representative algorithm from each cluster was used to create a diverse set of learning algorithms. The learning algorithms that were used are listed in Table 1. UML provides a diverse set of learning algorithms intended to be representative of $\mathcal {H}$.

Table 1 Set of learning algorithms used for filtering

The robustness of majority voting compared to filtering misclassified instances in supervised classification tasks

Abstract

Similar content being viewed by others

A geometric framework for multiclass ensemble classifiers

Removing Bias from Diverse Data Clusters for Ensemble Classification

Wisdom of Crowds: An Empirical Study of Ensemble-Based Feature Selection Strategies

Explore related subjects

1 Introduction

2 Related work

3 Modeling class noise in a discriminative model

4 Methodology

4.1 Calculating \(p(y_i|\hat{y}_i,x_i,h)\)

4.2 Experiments

4.2.1 Misclassification filters

4.2.2 Ensemble filter

4.2.3 Adaptive filter

4.2.4 Majority voting esemble

4.3 Evaluation

5 Results

5.1 Filtering results

5.2 Analysis of when to filter

5.3 Voting ensemble versus filtering

6 Conclusions and Discussion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Statistical Significance Tables

Appendix 2: Ensemble results for each data set

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation