Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1.1 Introduction

Over the last couple of decades, multiple classifier systems, also called ensemble systems have enjoyed growing attention within the computational intelligence and machine learning community. This attention has been well deserved, as ensemble systems have proven themselves to be very effective and extremely versatile in a broad spectrum of problem domains and real-world applications. Originally developed to reduce the variance—thereby improving the accuracy—of an automated decision-making system, ensemble systems have since been successfully used to address a variety of machine learning problems, such as feature selection, confidence estimation, missing feature, incremental learning, error correction, class-imbalanced data, learning concept drift from nonstationary distributions, among others. This chapter provides an overview of ensemble systems, their properties, and how they can be applied to such a wide spectrum of applications.

Truth be told, machine learning and computational intelligence researchers have been rather late in discovering the ensemble-based systems, and the benefits offered by such systems in decision making. While there is now a significant body of knowledge and literature on ensemble systems as a result of a couple of decades of intensive research, ensemble-based decision making has in fact been around and part of our daily lives perhaps as long as the civilized communities existed. You see, ensemble-based decision making is nothing new to us; as humans, we use such systems in our daily lives so often that it is perhaps second nature to us. Examples are many: the essence of democracy where a group of people vote to make a decision, whether to choose an elected official or to decide on a new law, is in fact based on ensemble-based decision making. The judicial system in many countries, whether based on a jury of peers or a panel of judges, is also based on ensemble-based decision making. Perhaps more practically, whenever we are faced with making a decision that has some important consequence, we often seek the opinions of different “experts” to help us make that decision; consulting with several doctors before agreeing to a major medical operation, reading user reviews before purchasing an item, calling references before hiring a potential job applicant, even peer review of this article prior to publication, are all examples of ensemble-based decision making. In the context of this discussion, we will loosely use the terms expert, classifier, hypothesis, and decision interchangeably.

While the original goal for using ensemble systems is in fact similar to the reason we use such mechanisms in our daily lives—that is, to improve our confidence that we are making the right decision, by weighing various opinions, and combining them through some thought process to reach a final decision—there are many other machine-learning specific applications of ensemble systems. These include confidence estimation, feature selection, addressing missing features, incremental learning from sequential data, data fusion of heterogeneous data types, learning nonstationary environments, and addressing imbalanced data problems, among others.

In this chapter, we first provide a background on ensemble systems, including statistical and computational reasons for using them. Next, we discuss the three pillars of the ensemble systems: diversity, training ensemble members, and combining ensemble members. After an overview of commonly used ensemble-based algorithms, we then look at various aforementioned applications of ensemble systems as we try to answer the question “what else can ensemble systems do for you?”

1.1.1 Statistical and Computational Justifications for Ensemble Systems

The premise of using ensemble-based decision systems in our daily lives is fundamentally not different from their use in computational intelligence. We consult with others before making a decision often because of the variability in the past record and accuracy of any of the individual decision makers. If in fact there were such an expert, or perhaps an oracle, whose predictions were always true, we would never need any other decision maker, and there would never be a need for ensemble-based systems. Alas, no such oracle exists; every decision maker has an imperfect past record. In other words, the accuracy of each decision maker’s decision has a nonzero variability. Now, note that any classification error is composed of two components that we can control: bias, the accuracy of the classifier; and variance, the precision of the classifier when trained on different training sets. Often, these two components have a trade-off relationship: classifiers with low bias tend to have high variance and vice versa. On the other hand, we also know that averaging has a smoothing (variance-reducing) effect. Hence, the goal of ensemble systems is to create several classifiers with relatively fixed (or similar) bias and then combining their outputs, say by averaging, to reduce the variance.

The reduction of variability can be thought of as reducing high-frequency (high-variance) noise using a moving average filter, where each sample of the signal is averaged by a neighbor of samples around it. Assuming that noise in each sample is independent, the noise component is averaged out, whereas the information content that is common to all segments of the signal is unaffected by the averaging operation. Increasing classifier accuracy using an ensemble of classifiers works exactly the same way: assuming that classifiers make different errors on each sample, but generally agree on their correct classifications, averaging the classifier outputs reduces the error by averaging out the error components.

It is important to point out two issues here: first, in the context of ensemble systems, there are many ways of combining ensemble members, of which averaging the classifier outputs is only one method. We discuss different combination schemes later in this chapter. Second, combining the classifier outputs does not necessarily lead to a classification performance that is guaranteed to be better than the best classifier in the ensemble. Rather, it reduces our likelihood of choosing a classifier with a poor performance. After all, if we knew a priori which classifier would perform the best, we would only use that classifier and would not need to use an ensemble. A representative illustration of the variance reduction ability of the ensemble of classifiers is shown in Fig. 1.1.

Fig. 1.1
figure 1figure 1

Variability reduction using ensemble systems

1.1.2 Development of Ensemble Systems

Many reviews refer to Dasarathy and Sheela’s 1979 work as one of the earliest example of ensemble systems [1], with their ideas on partitioning the feature space using multiple classifiers. About a decade later, Hansen and Salamon showed that an ensemble of similarly configured neural networks can be used to improve classification performance [2]. However, it was Schapire’s work that demonstrated through a procedure he named boosting that a strong classifier with an arbitrarily low error on a binary classification problem, can be constructed from an ensemble of classifiers, the error of any of which is merely better than that of random guessing [3]. The theory of boosting provided the foundation for the subsequent suite of AdaBoost algorithms, arguably the most popular ensemble-based algorithms, extending the boosting concept to multiple class and regression problems [4]. We briefly describe the boosting algorithms below, but a more detailed coverage of these algorithms can be found in Chap. 2 of this book, and Kuncheva’s text [5].

In part due to success of these seminal works, and in part based on independent efforts, research in ensemble systems have since exploded, with different flavors of ensemble-based algorithms appearing under different names: bagging [6], random forests (an ensemble of decision trees), composite classifier systems [1], mixture of experts (MoE) [78], stacked generalization [9], consensus aggregation [10], combination of multiple classifiers [1112131415], dynamic classifier selection [15], classifier fusion [161718], committee of neural networks [19], classifier ensembles [1920], among many others. These algorithms, and in general all ensemble-based systems, typically differ from each other based on the selection of training data for individual classifiers, the specific procedure used for generating ensemble members, and/or the combination rule for obtaining the ensemble decision. As we will see, these are the three pillars of any ensemble system.

In most cases, ensemble members are used in one of two general settings: classifier selection and classifier fusion [51521]. In classifier selection, each classifier is trained as a local expert in some local neighborhood of the entire feature space. Given a new instance, the classifier trained with data closest to the vicinity of this instance, in some distance metric sense, is then chosen to make the final decision, or given the highest weight in contributing to the final decision [7152223]. In classifier fusion all classifiers are trained over the entire feature space, and then combined to obtain a composite classifier with lower variance (and hence lower error). Bagging [6], random forests [24], arc-x4 [25], and boosting/AdaBoost [34] are examples of this approach. Combining the individual classifiers can be based on the labels only, or based on class-specific continuous valued outputs [182627], for which classifier outputs are first normalized to the [0, 1] interval to be interpreted as the support given by the classifier to each class [1828]. Such interpretation leads to algebraic combination rules (simple or weighted majority voting, maximum/minimum/sum/product, or other combinations class-specific outputs) [122729], the Dempster–Shafer-based classifier fusion [1330], or decision templates [18212631]. Many of these combination rules are discussed below in more detail.

A sample of the immense literature on classifier combination can be found in Kuncheva’s book [5] (and references therein), an excellent text devoted to theory and implementation of ensemble-based classifiers.

1.2 Building an Ensemble System

Three strategies need to be chosen for building an effective ensemble system. We have previously referred to these as the three pillars of ensemble systems: (1) data sampling/selection; (2) training member classifiers; and (3) combining classifiers.

1.2.1 Data Sampling and Selection: Diversity

Making different errors on any given sample is of paramount importance in ensemble-based systems. After all, if all ensemble members provide the same output, there is nothing to be gained from their combination. Therefore, we need diversity in the decisions of ensemble members, particularly when they are making an error. The importance of diversity for ensemble systems is well established [3233]. Ideally, classifier outputs should be independent or preferably negatively correlated [3435].

Diversity in ensembles can be achieved through several strategies, although using different subsets of the training data is the most common approach, also illustrated in Fig. 1.1. Different sampling strategies lead to different ensemble algorithms. For example, using bootstrapped replicas of the training data leads to bagging, whereas sampling from a distribution that favors previously misclassified samples is the core of boosting algorithms. On the other hand, one can also use different subsets of the available features to train each classifier, which leads to random subspace methods [36]. Other less common approaches also include using different parameters of the base classifier (such as training an ensemble of multilayer perceptrons, each with a different number of hidden layer nodes), or even using different base classifiers as the ensemble members. Definitions of different types of diversity measures can be found in [53738]. We should also note that while the importance of diversity, and lack of diversity leading to inferior ensemble performance has been wellestablished, an explicit relationship between diversity and ensemble accuracy has not been identified [3839].

1.2.2 Training Member Classifiers

At the core of any ensemble-based system is the strategy used to train individual ensemble members. Numerous competing algorithms have been developed for training ensemble classifiers; however, bagging (and related algorithms arc-x4 and random forests), boosting (and its many variations), stack generalization and hierarchical MoE remain as the most commonly employed approaches. These approaches are discussed in more detail below, in Sect. 1.3.

1.2.3 Combining Ensemble Members

The last step in any ensemble-based system is the mechanism used to combine the individual classifiers. The strategy used in this step depends, in part, on the type of classifiers used as ensemble members. For example, some classifiers, such as support vector machines, provide only discrete-valued label outputs. The most commonly used combination rules for such classifiers is (simple or weighted) majority voting followed at a distant second by the Borda count. Other classifiers, such as multilayer perceptron or (naïve) Bayes classifier, provide continuous valued class-specific outputs, which are interpreted as the support given by the classifier to each class. A wider array of options is available for such classifiers, such as arithmetic (sum, product, mean, etc.) combiners or more sophisticated decision templates, in addition to voting-based approaches. Many of these combiners can be used immediately after the training is complete, whereas more complex combination algorithms may require an additional training step (as used in stacked generalization or hierarchical MoE). We now briefly discuss some of these approaches.

1.2.3.1 Combining Class Labels

Let us first assume that only the class labels are available from the classifier outputs, and define the decision of the tth classifier as dt, c ∈ {0,1}, t = 1, , T and c  = 1, …, C, where Tis the number of classifiers and Cis the number of classes. If tth classifier (or hypothesis) ht chooses class ωc, then dt, c   = 1, and 0, otherwise. Note that the continuous valued outputs can easily be converted to label outputs (by assigning dt, c  = 1 for the class with the highest output), but not vice versa. Therefore, the combination rules described in this section can also be used by classifiers providing specific class supports.

1.2.3.1.1 Majority Voting

Majority voting has three flavors, depending on whether the ensemble decision is the class (1) on which all classifiers agree (unanimous voting); (2) predicted by at least one more than half the number of classifiers (simple majority); or (3) that receives the highest number of votes, whether or not the sum of those votes exceeds 50% (plurality voting). When not specified otherwise, majority voting usually refers to plurality voting, which can be mathematically defined as follows: choose class ω\({}_{{c}^{{_\ast}}}\), if

$$\begin{array}{rcl} {\sum }_{t=1}^{T}{d}_{ t,{c}^{{_\ast}}} ={ \mbox{ max}}_{c}{ \sum }_{t=1}^{T}{d}_{ t,c}& &\end{array}$$
(1.1)

If the classifier outputs are independent, then it can be shown that majority voting is the optimal combination rule. To see this, consider an odd number of T classifiers, with each classifier having a probability of correct classification p. Then, the ensemble makes the correct decision if at least \(\lfloor T/2\rfloor + 1\) of these classifiers choose the correct label. Here, the floor function ⌊ ■ ⌋ returns the largest integer less than or equal to its argument. The accuracy of the ensemble is governed by the binomial distribution; the probability of having k ≥ T/2 + 1 out of T classifiers returning the correct class. Since each classifier has a success rate of p, the probability of ensemble success is then

$$\begin{array}{rcl}{ p}_{\mathrm{ens}} ={ \sum }_{k=\frac{T} {2} +1}^{T}\left (\begin{array}{*{20}c} T \\ k\\ \end{array} \right ){p}^{k}{(1 - p)}^{T-k}& &\end{array}$$
(1.2)

Note that Pens approaches 1 as T → , if p  >  0.5; and it approaches 0 if p <  0.5. This result is also known as the Condorcet Jury theorem (1786), as it formalizes the probability of a plurality-based jury decision to be the correct one. Equation (1.2) makes a powerful statement: if the probability of a member classifier giving the correct answer is higher than 1 ∕ 2, which really is the least we can expect from a classifier on a binary class problem, then the probability of success approaches 1 very quickly. If we have a multiclass problem, the same concept holds as long as each classifier has a probability of success better than random guessing (i.e., p > 14 for a four class problem). An extensive and excellent analysis of the majority voting approach can be found in [5].

1.2.3.1.2 Weighted Majority Voting

If we have reason to believe that some of the classifiers are more likely to be correct than others, weighting the decisions of those classifiers more heavily can further improve the overall performance compared to that of plurality voting. Let us assume that we have a mechanism for predicting the (future) approximate generalization performance of each classifier. We can then assign a weight Wt to classifier ht in proportion of its estimated generalization performance. The ensemble, combined according to weighted majority voting then chooses class c ∗ , if

$$\begin{array}{rcl} {\sum \nolimits }_{t=1}^{T}{w}_{ t}{d}_{t,{c}^{{_\ast}}} {=\max }_{c}{ \sum \nolimits }_{t=1}^{T}{w}_{ t}{d}_{t,c}& &\end{array}$$
(1.3)

that is, if the total weighted vote received by class ω\({}_{{c}^{{_\ast}}}\) is higher than the total vote received by any other class. In general, voting weights are normalized such that they add up to 1.

So, how do we assign the weights? If we knew, a priori, which classifiers would work better, we would only use those classifiers. In the absence of such information, a plausible and commonly used strategy is to use the performance of a classifier on a separate validation (or even training) dataset, as an estimate of that classifier’s generalization performance. As we will see in the later sections, AdaBoost follows such an approach. A detailed discussion on weighted majority voting can also be found in [40].

1.2.3.1.3 Borda Count

Voting approaches typically use a winner-take-all strategy, i.e., only the class that is chosen by each classifier receives a vote, ignoring any support that nonwinning classes may receive. Borda count uses a different approach, feasible if we can rank order the classifier outputs, that is, if we know the class with the most support (the winning class), as well as the class with the second most support, etc. Of course, if the classifiers provide continuous outputs, the classes can easily be rank ordered with respect to the support they receive from the classifier.

In Borda count, devised in 1770 by Jean Charles de Borda, each classifier (decision maker) rank orders the classes. If there are C candidates, the winning class receives C-1 votes, the class with the second highest support receives C-2 votes, and the class with the ith highest support receives C-i votes. The class with the lowest support receives no votes. The votes are then added up, and the class with the most votes is chosen as the ensemble decision.

1.2.3.2 Combining Continuous Outputs

If a classifier provides continuous output for each class (such as multilayer perceptron or radial basis function networks, naïve Bayes, relevance vector machines, etc.), such outputs—upon proper normalization (such as softmax normalization in (1.4) [41])—can be interpreted as the degree of support given to that class, and under certain conditions can also be interpreted as an estimate of the posterior probability for that class. Representing the actual classifier output corresponding to class ωc for instance x as gc( x), and the normalized values as \(\tilde{{g}}_{c}\mathbf{(\mathit{x})}\), approximated posterior probabilities P\({}_{c}\vert\)x) can be obtained as

$$\begin{array}{rcl} P({\omega }_{c}\vert \mathbf{\mathit{x}}) \approx \tilde{ {g}}_{c}\mathbf{(\mathit{x})} = \frac{{\mbox{ e}}^{{g}_{c}(x)}} {{\sum \nolimits }_{i=1}^{C}{\mbox{ e}}^{{g}_{i}(x)}} \Rightarrow {\sum \nolimits }_{i=1}^{C}\tilde{{g}}_{ i}\mathbf{(\mathit{x})} = 1& &\end{array}$$
(1.4)

In order to consolidate different combination rules, we use Kuncheva’s decision profile matrix DP( x) [18], whose elements dt, c ∈ [0, 1] represent the support given by the tth classifier to class ωc. Specifically, as illustrated in Fig. 1.2, the rows of DP( x) represent the support given by individual classifiers to each of the classes, whereas the columns represent the support received by a particular class c from all classifiers.

Fig. 1.2
figure 2figure 2

Decision profile for a given instance x

1.2.3.2.1 Algebraic Combiners

In algebraic combiners, the total support for each class is obtained as a simple algebraic function of the supports received by individual classifiers. Following the notation used in [18], let us represent the total support received by class ωc, the cth column of the decision profile DP( x), as

$$\begin{array}{rcl}{ \mu }_{c}\mathbf{(\mathit{x})} = F\left [{d}_{1,c}\mathbf{(\mathit{x})},...,{d}_{T,C}\mathbf{(\mathit{x})}\right ]& &\end{array}$$
(1.5)

where F[ ■ ] is one of the following combination functions.

Mean Rule: The support for class ωc is the average of all classifiers’ cth outputs,

$$\begin{array}{rcl}{ \mu }_{c}\mathbf{(\mathit{x})} = \frac{1} {T}{\sum \nolimits }_{t=1}^{T}{d}_{ t,c}\mathbf{(\mathit{x})}& &\end{array}$$
(1.6)

hence the function F[ ⋅] is the averaging function. Note that the mean rule results in the identical final classification as the sum rule, which only differs from the mean rule by the 1/T normalization factor. In either case, the final decision is the class ωc for which the total support μc( x) is the highest.

Weighted Average: The weighted average rule combines the mean and the weighted majority voting rules, where the weights are applied not to class labels, but to the actual continuous outputs. The weights can be obtained during the ensemble generation as part of the regular training, as in AdaBoost, or a separate training can be used to obtain the weights, such as in a MoE. Usually, each classifier ht receives a weight, although it is also possible to assign a weight to each class output of each classifier. In the former case, we have T weights, w1, …, wT, usually obtained as estimated generalization performances based on training data, with the total support for class ωc as

$$\begin{array}{rcl}{ \mu }_{c}\mathbf{(\mathit{x})} = \frac{1} {T}{\sum \nolimits }_{t=1}^{T}{w}_{ t}{d}_{t,c}\mathbf{(\mathit{x})}& &\end{array}$$
(1.7)

In the latter case, there are T * C class and classifier-specific weights, which leads to a class-conscious combination of classifier outputs [18]. Total support for class ωc is then

$$\begin{array}{rcl}{ \mu }_{c}\mathbf{(\mathit{x})} = \frac{1} {T}{\sum \nolimits }_{t=1}^{T}{w}_{ t,c}{d}_{t,c}\mathbf{(\mathit{x})}& &\end{array}$$
(1.8)

where wt, c is the weight of the tth classifier for classifying class ωc instances.

Trimmed mean: Sometimes classifiers may erroneously give unusually low or high support to a particular class such that the correct decisions of other classifiers are not enough to undo the damage done by this unusual vote. This problem can be avoided by discarding the decisions of those classifiers with the highest and lowest support before calculating the mean. This is called trimmed mean. For a R% trimmed mean, R% of the support from each end is removed, with the mean calculated on the remaining supports, avoiding the extreme values of support. Note that 50% trimmed mean is equivalent to the median rule discussed below.

Minimum/Maximum/Median Rule: These functions simply take the minimum, maximum, or the median among the classifiers’ individual outputs.

$$\begin{array}{rcl}{ \mu }_{c}\mathbf{(\mathit{x})}& =& {\min }_{t=1,\ldots ,T}\{{d}_{t,c}\mathbf{(\mathit{x})}\} \\ {\mu }_{c}\mathbf{(\mathit{x})}& =& {\max }_{t=1,\ldots ,T}\{{d}_{t,c}\mathbf{(\mathit{x})}\} \\ {\mu }_{c}\mathbf{(\mathit{x})}& =&{ \mathrm{median}}_{t=1,\ldots ,T}\{{d}_{t,c}\mathbf{(\mathit{x})}\}\end{array}$$
(1.9)

where the ensemble decision is chosen as the class for which total support is largest. Note that the minimum rule chooses the class for which the minimum support among the classifiers is highest.

Product Rule: The product rule chooses the class whose product of supports from each classifier is the highest. Due to the nulling nature of multiplying with zero, this rule decimates any class that receives at least one zero (or very small) support.

$$\begin{array}{rcl}{ \mu }_{c}\mathbf{(\mathit{x})} = \frac{1} {T}{\prod \nolimits }_{t=1}^{T}{d}_{ t,c}\mathbf{(\mathit{x})}& &\end{array}$$
(1.10)

Generalized Mean: All of the aforementioned rules are in fact special cases of the generalized mean,

$$\begin{array}{rcl}{ \mu }_{c}\mathbf{(\mathit{x})} ={ \left ( \frac{1} {T}{\sum \nolimits }_{t=1}^{T}{\left ({d}_{ t,c}\mathbf{(\mathit{x})}\right )}^{\alpha }\right )}^{1/\alpha }& &\end{array}$$
(1.11)

where different choices of α lead to different combination rules. For example, α → -, leads to minimum rule, and α → 0, leads to

$$\begin{array}{rcl}{ \mu }_{c}(x) ={ \left ({\prod \nolimits }_{t=1}^{T}\left ({d}_{ t,c}(x)\right )\right )}^{1/T}& &\end{array}$$
(1.12)

which is the geometric mean, a modified version of the product rule. For α → 1, we get the mean rule, and α →   leads to the maximum rule.

Decision Template: Consider computing the average decision profile observed for each class throughout training. Kuncheva defines this average decision profile as the decision template of that class [18]. We can then compare the decision profile of a given instance to the decision templates (i.e., average decision profiles) of each class, choosing the class whose decision template is closest to the decision profile of the current instance, in some similarity measure. The decision template for class ωc is then computed as

$$\begin{array}{rcl} D{T}_{c} = 1\left /\right . {N}_{c}{ \sum \nolimits }_{{X}_{c}\in {\omega }_{c}}DP\left ({X}_{c}\right )& &\end{array}$$
(1.13)

as the average decision profile obtained from Xc, the set of training instances (of cardinality Nc) whose true class is ωc. Given an unlabeled test instance x, we first construct its decision profile DP( x) from the ensemble outputs and calculate the similarity S between DP( x) and the decision template DTc for each class ωc as the degree of support given to class ωc.

$$\begin{array}{rcl}{ \mu }_{c}\mathbf{(\mathit{x})} = S(DP(x),D{T}_{c}),c = 1,\ldots ,C& &\end{array}$$
(1.14)

where the similarity measure S is usually a squared Euclidean distance,

$$\begin{array}{rcl}{ \mu }_{c}\mathbf{(\mathit{x})} = 1 - \frac{1} {T \times C}{\sum \nolimits }_{t=1}^{T}{{\sum \nolimits }_{i=1}^{C}\left (D{T}_{ c}(t,i) - {d}_{t,i}\mathbf{(\mathit{x})}\right )}^{2}& &\end{array}$$
(1.15)

and where DTc(t, i) is the decision template support given by the tth classifier to class ωi, i.e., the support given by the tth classifier to class ωi, averaged over all class ωc training instances. We expect this support to be high when i = c, and low otherwise. The second term dt, i(x) is the support given by the tth classifier to class ωi for the given instance x. The class with the highest total support is then chosen as the ensemble decision.

1.3 Popular Ensemble-Based Algorithms

A rich collection of ensemble-based classifiers have been developed over the last several years. However, many of these are some variation of the select few well-established algorithms whose capabilities have also been extensively tested and widely reported. In this section, we present an overview of some of the most prominent ensemble algorithms.

1.3.1 Bagging

Breiman’s bagging (short for Bootstrap Aggregation) algorithm is one of the earliest and simplest, yet effective, ensemble-based algorithms. Given a training dataset S of cardinality N, bagging simply trains T independent classifiers, each trained by sampling, with replacement, N instances (or some percentage of N) from S. The diversity in the ensemble is ensured by the variations within the bootstrapped replicas on which each classifier is trained, as well as by using a relatively weak classifier whose decision boundaries measurably vary with respect to relatively small perturbations in the training data. Linear classifiers, such as decision stumps, linear SVM, and single layer perceptrons are good candidates for this purpose. The classifiers so trained are then combined via simple majority voting. The pseudocode for bagging is provided in Algorithm 1.

Bagging is best suited for problems with relatively small available training datasets. A variation of bagging, called Pasting Small Votes [42], designed for problems with large training datasets, follows a similar approach, but partitioning the large dataset into smaller segments. Individual classifiers are trained with these segments, called bites, before combining them via majority voting.

Another creative version of bagging is the Random Forest algorithm, essentially an ensemble of decision trees trained with a bagging mechanism [24]. In addition to choosing instances, however, a random forest can also incorporate random subset selection of features as described in Ho’s random subspace models [36].

1.3.2 Boosting and AdaBoost

Boosting, introduced in Schapire’s seminal work strength of weak learning [3], is an iterative approach for generating a strong classifier, one that is capable of achieving arbitrarily low training error, from an ensemble of weak classifiers, each of which can barely do better than random guessing. While boosting also combines an ensemble of weak classifiers using simple majority voting, it differs from bagging in one crucial way. In bagging, instances selected to train individual classifiers are bootstrapped replicas of the training data, which means that each instance has equal chance of being in each training dataset. In boosting, however, the training dataset for each subsequent classifier increasingly focuses on instances misclassified by previously generated classifiers.

Boosting, designed for binary class problems, creates sets of three weak classifiers at a time: the first classifier (or hypothesis) h1 is trained on a random subset of the available training data, similar to bagging. The second classifier, h2, is trained on a different subset of the original dataset, precisely half of which is correctly identified by h1, and the other half is misclassified. Such a training subset is said to be the “most informative,” given the decision of h1. The third classifier h3 is then trained with instances on which h1 and h2 disagree. These three classifiers are then combined through a three-way majority vote. Schapire proved that the training error of this three-classifier ensemble is bounded above by g(ε)  <  3ε2 − 2ε3, where ε is the error of any of the three classifiers, provided that each classifier has an error rate ε <  0.5, the least we can expect from a classifier on a binary classification problem.

AdaBoost (short for Adaptive Boosting) [4], and its several variations later extended the original boosting algorithm to multiple classes (AdaBoost.M1, AdaBost.M2), as well as to regression problems (AdaBoost.R). Here we describe the AdaBoost.M1, the most popular version of the AdaBoost algorithms.

AdaBoost has two fundamental differences from boosting: (1) instances are drawn into the subsequent datasets from an iteratively updated sample distribution of the training data; and (2) the classifiers are combined through weighted majority voting, where voting weights are based on classifiers’ training errors, which themselves are weighted according to the sample distribution. The sample distribution ensures that harder samples, i.e., instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier.

The pseudocode of the AdaBoost.M1 is provided in Algorithm 2. The sample distribution, Dt(i) essentially assigns a weight to each training instance xi, i = 1, …, N, from which training data subsets St are drawn for each consecutive classifier (hypothesis) ht. The distribution is initialized to be uniform; hence, all instances have equal probability to be drawn into the first training dataset. The training error εt of classifier ht is then computed as the sum of these distribution weights of the instances misclassified by ht ((1.17), where  ■  is 1 if its argument is true and 0 otherwise). AdaBoost.M1 requires that this error be less than 1 ∕ 2, which is then normalized to obtain βt, such that 0  <  βt  <  1 for 0  <  εt  <  1 ∕ 2.

The heart of AdaBoost.M1 is the distribution update rule shown in (1.19): the distribution weights of the instances correctly classified by the current hypothesis ht are reduced by a factor of βt, whereas the weights of the misclassified instances are left unchanged. When the updated weights are renormalized by Zt to ensure that Dt + 1 is a proper distribution, the weights of the misclassified instances are effectively increased. Hence, with each new classifier added to the ensemble, AdaBoost focuses on increasingly difficult instances. At each iteration t, (1.19) raises the weights of misclassified instances such that they add up to 1 ∕ 2, and lowers those of correctly classified ones, such that they too add up to 1 ∕ 2. Since the base model learning algorithm BaseClassifier is required to have an error less than 1 ∕ 2, it is guaranteed to correctly classify at least one previously misclassified training example. When it is unable to do so, AdaBoost aborts; otherwise, it continues until Tclassifiers are generated, which are then combined using the weighted majority voting.

Note that the reciprocals of the normalized errors of individual classifiers are used as voting weights in weighted majority voting in AdaBoost.M1; hence, classifiers that have shown good performance during training (low βt) are rewarded with higher voting weights. Since the performance of a classifier on its own training data can be very close to zero, βt can be quite large, causing numerical instabilities. Such instabilities are avoided by the use of the logarithm in the voting weights (1.20).

Much of the popularity of AdaBoost.M1 is not only due to its intuitive and extremely effective structure but also due to Freund and Schapire’s elegant proof that shows the training error of AdaBoost.M1 as bounded above

$$\begin{array}{rcl}{ E}_{\mathrm{ensemble}} < {2}^{T}{ \prod }_{t=1}^{T}\sqrt{{\epsilon }_{ t}(1 - {\epsilon }_{t})}& &\end{array}$$
(1.16)

Since εt  <  1/2, Eensemble, the error of the ensemble, is guaranteed to decrease as the ensemble grows. It is interesting, however, to note that AdaBoost.M1 still requires the classifiers to have a (weighted) error that is less than 1 ∕ 2 even on nonbinary class problems. Achieving this threshold becomes increasingly difficult as the number of classes increase. Freund and Schapire recognized that there is information even in the classifiers’ nonselected class outputs. For example, in handwritten character recognition problem, the characters “1” and “7” look alike, and the classifier may give a high support to both of these classes, and low support to all others. AdaBoost.M2 takes advantage of the supports given to nonchosen classes and defines a pseudo-loss, and unlike the error in AdaBoost.M1, is no longer required to be less than 1 ∕ 2. Yet AdaBoost.M2 has a very similar upper bound for training error as AdaBoost.M1. AdaBoost.R is another variation—designed for function approximation problems—that essentially replaces classification error with regression error [4].

1.3.3 Stacked Generalization

The algorithms described so far use nontrainable combiners, where the combination weights are established once the member classifiers are trained. Such a combination rule does not allow determining which member classifier has learned which partition of the feature space. Using trainable combiners, it is possible to determine which classifiers are likely to be successful in which part of the feature space and combine them accordingly. Specifically, the ensemble members can be combined using a separate classifier, trained on the outputs of the ensemble members, which leads to the stacked generalization model.

Wolpert’s stacked generalization [9], illustrated in Fig. 1.3, first creates T Tier-1 classifiers, C1, …, CT, based on a cross-validation partition of the training data. To do so, the entire training dataset is divided into B blocks, and each Tier-1 classifier is first trained on (a different set of) B − 1 blocks of the training data. Each classifier is then evaluated on the Bth (pseudo-test) block, not seen during training. The outputs of these classifiers on their pseudo-training blocks constitute the training data for the Tier-2 (meta) classifier, which effectively serves as the combination rule for the Tier-1 classifiers. Note that the meta-classifier is not trained on the original feature space, but rather on the decision space of Tier-1 classifiers.

Fig. 1.3
figure 3figure 3

Stacked generalization

Once the meta-classifier is trained, all Tier-1 classifiers (each of which has been trained B times on overlapping subsets of the original training data) are discarded, and each is retrained on the combined entire training data. The stacked generalization model is then ready to evaluate previously unseen field data.

1.3.4 Mixture of Experts

Mixture of experts is a similar algorithm, also using a trainable combiner. MoE, also trains an ensemble of (Tier-1) classifiers using a suitable sampling technique. Classifiers are then combined through a weighted combination rule, where the weights are determined through a gating network [7], which itself is typically trained using expectation-maximization (EM) algorithm [843] on the original training data. Hence, the weights determined by the gating network are dynamically assigned based on the given input, as the MoE effectively learns which portion of the feature space is learned by each ensemble member. Figure 1.4 illustrates the structure of the MoE algorithm.

Fig. 1.4
figure 4figure 4

Mixture of experts model

Mixture-of-experts can also be seen as a classifier selection algorithm, where individual classifiers are trained to become experts in some portion of the feature space. In this setting, individual classifiers are indeed trained to become experts, and hence are usually not weak classifiers. The combination rule then selects the most appropriate classifier, or classifiers weighted with respect to their expertise, for each given instance. The pooling/combining system may then choose a single classifier with the highest weight, or calculate a weighted sum of the classifier outputs for each class, and pick the class that receives the highest weighted sum.

1.4 What Else Can Ensemble Systems Do for You?

While ensemble systems were originally developed to reduce the variability in classifier decision and thereby increase generalization performance, there are many additional problem domains where ensemble systems have proven to be extremely effective. In this section, we discuss some of these emerging applications of ensemble systems along with a family of algorithms, called Learn\({}^{++}\), which are designed for these applications.

1.4.1 Incremental Learning

In many real-world applications, particularly those that generate large volumes of data, such data often become available in batches over a period of time. These applications need a mechanism to incorporate the additional data into the knowledge base in an incremental manner, preferably without needing access to the previous data. Formally speaking, incremental learning refers to sequentially updating a hypothesis using current data and previous hypotheses—but not previous data—such that the current hypothesis describes all data that have been acquired thus far. Incremental learning is associated with the well-known stability–plasticity dilemma, where stability refers to the algorithm’s ability to retain existing knowledge and plasticity refers to the algorithm’s ability to acquire new data. Improving one usually comes at the expense of the other. For example, online data streaming algorithms usually have good plasticity but poor stability, whereas many of the well-established supervised algorithms, such as MLP, SVM, and kNN have good stability but poor plasticity properties.

Ensemble-based systems provide an intuitive approach for incremental learning that also provides a balanced solution to the stability–plasticity dilemma. Consider the AdaBoost algorithm which directs the subsequent classifiers toward increasingly difficult instances. In an incremental learning setting, some of the instances introduced by the new batch can also be interpreted as “difficult” if they carry novel information. Therefore, an AdaBoost-like approach can be used in an incremental learning setting with certain modifications, such as creating a new ensemble with each batch that become available; resetting the sampling distribution based on the performance of the existing ensemble on the new batch of training data, and relaxing the abort clause. Note, however, that distribution update rule in AdaBoost directs the sampling distribution toward those instances misclassified by the previous classifier. In an incremental learning setting, it is necessary to direct the algorithm to focus on those novel instances introduced by the new batch of data that are not yet learned by the current ensemble, not by the previous classifier. Learn \({}^{++}\) algorithm, introduced in [4445], incorporate these ideas.

The incremental learning problem becomes particularly challenging if the new data also introduce new classes. This is because classifiers previously trained on earlier batches of data inevitably misclassify instances of the new class on which they were not trained. Only the new classifiers are able to recognize the new class(es). Therefore, any decision by the new classifiers correctly choosing the new class is outvoted by the earlier classifiers, until there are enough new classifiers to counteract the total vote of those original classifiers. Hence, a relatively large number of new classifiers that recognize the new class are needed, so that their total weight can overwrite the incorrect votes of the original classifiers.

The Learn \({}^{++}\).NC (for New Classes), described in Algorithm 3, addresses these issues [46] by assigning dynamic weights to ensemble members, based on its prediction of which classifiers are likely to perform well on which classes. Learn \({}^{++}\).NC cross-references the predictions of each classifier—with those of others—with respect to classes on which they were trained. Looking at the decisions of other classifiers, each classifier decides whether its decision is in line with the predictions of others, and the classes on which it was trained. If not, the classifier reduces its vote, or possibly refrains from voting altogether. As an example, consider an ensemble of classifiers, E1, trained with instances from two classes ω1, and ω2; and a second ensemble, E2, trained on instances from classes ω1, ω2, and a new class, ω3. An instance from the new class ω3 is shown to all classifiers. Since E1 classifiers do not recognize class ω3, they incorrectly choose ω1 or ω2, whereas E2 classifiers correctly recognize ω3. Learn \({}^{++}\).NC keeps track of which classifiers are trained on which classes. In this example, knowing that E2 classifiers have seen ω3 instances, and that E1 classifiers have not, it is reasonable to believe that E2 classifiers are correct, particularly if they overwhelmingly choose ω3 for that instance. To the extent E2 classifiers are confident of their decision, the voting weights of E1 classifiers can therefore be reduced. Then, E2 no longer needs a large number of classifiers: in fact, if E2 classifiers agree with each other on their correct decision, then very few classifiers are adequate to remove any bias induced by E1. This voting process, described in Algorithm 4, is called dynamically weighted consult-and-vote (DW-CAV) [46].

Specifically, Learn \({}^{++}\).NC updates its sampling distribution based on the composite hypothesis H ((1.25)), which is the ensemble decision of all classifiers generated thus far. The composite hypothesis Htk for the first t classifiers from the kth batch is computed by the weighted majority voting of all classifiers using the weights Wtk, which themselves are weighted based on each classifiers class-specific confidence Pc ((1.27) and (1.28)).

The class-specific confidence Pc(i) for instance xi is the ratio of total weight of all classifiers that choose class ωc (for instance xi), to the total weight of all classifiers that have seen class ωc. Hence, Pc(i) represents the collective confidence of classifiers trained on class ωc in choosing class ωc for instance xi. A high value of Pc(i), close to 1, indicates that classifiers trained to recognize class ωc have in fact overwhelmingly picked class ωc, and hence those that were not trained on ωc should not vote (or reduce their voting weight) for that instance.

Extensive experiments with Learn \({}^{++}\).NC showed that the algorithm can very quickly learn new classes when they are present, and in fact is also able to remember a class, when it is no longer present in future data batches [46].

1.4.2 Data Fusion

A common problem in many large-scale data analysis and automated decision making applications is to combine information from different data sources that often provide heterogeneous data. Diagnosing a disease from several blood or behavioral tests, imaging results, and time series data (such as EEG or ECG) is such an application. Detecting the health of a system or predicting weather patterns based on data from a variety of sensors, or the health of a company based on several sources of financial indicators are other examples of data fusion. In most data fusion applications, the data are heterogeneous, that is, they are of different format, dimensionality, or structure: some are scalar variables (such as blood pressure, temperature, humidity, speed), some are time series data (such as electrocardiogram, stock prices over a period of time, etc.), some are images (such as MRI or PET images, 3D visualizations, etc.).

Ensemble systems provide a naturally suited solution for such problems: individual classifiers (or even an ensemble of classifiers) can be trained on each data source and then combined through a suitable combiner. The stacked generalization or MoEs structures are particularly well suited for data fusion applications. In both cases, each classifier (or even a model of ensemble of classifiers) can be trained on a separate data source. Then, a subsequent meta-classifier or a gating network can be trained to learn which models or experts have better prediction accuracy, or which ones have learned which feature space. Figure 1.5 illustrates this structure.

Fig. 1.5
figure 5figure 5

Ensemble systems for data fusion

A comprehensive review of using ensemble-based systems for data fusion, as well as detailed description of Learn \({}^{++}\) implementation for data fusion—shown to be quite successful on a variety of data fusions problems—can be found in [47]. Other ensemble-based fusion approaches include combining classifiers using Dempster–Shafer-based combination [484950], ARTMAP [51], genetic algorithms [52], and other combinations of boosting/voting methods [535455]. Using diversity metrics for ensemble-based data fusion is discussed in [56].

1.4.3 Feature Selection and Classifying with Missing Data

While most ensemble-based systems create individual classifiers by altering the training data instances—but keeping all features for a given instance—individual features can also be altered by using all of the training data available. In such a setting, individual classifiers are trained with different subsets of the entire feature set. Algorithms that use different feature subsets are commonly referred to as random subspace methods, a term coined by Ho [36]. While Ho used this approach for creating random forests, the approach can also be used for feature selection as well as diversity enhancement.

Another interesting application of RSM-related methods is to use the ensemble approach to classify data that have missing features. Most classification algorithms have matrix multiplications that require the entire feature vector to be available. However, missing data is quite common in real-world applications: bad sensors, failed pixels, unanswered questions in surveys, malfunctioning equipment, medical tests that cannot be administered under certain conditions, etc. are all common scenarios in practice that can result in missing attributes. Feature values that are beyond the expected dynamic range of the data due to extreme noise, signal saturation, data corruption, etc. can also be treated as missing data.

Typical solutions to missing features include imputation algorithms where the value of the missing variable is estimated based on other observed values of that variable. Imputation-based algorithms (such as expectation maximization, mean imputation, k-nearest neighbor imputation, etc.), are popular because they are theoretically justified and tractable; however, they are also prone to significant estimation errors particularly for large dimensional and/or noisy datasets.

An ensemble-based solution to this problem was offered in Learn \({}^{++}\).MF [57] (MF for Missing Features), which generates a large number of classifiers, each of which is trained using only random subsets of the available features. The instance sampling distribution in other versions of Learn \({}^{++}\) algorithms is replaced with a feature sampling distribution, which favors those features that have not been well represented in the previous classifiers’ feature sets. Then, a data instance with missing features is classified using the majority voting of only those classifiers whose feature sets did not include the missing attributes. This is conceptually illustrated in Fig. 1.6a, which shows 10 classifiers, each trained on three of the six features available in the dataset. Features that are not used during training are indicated with an “X.” Then, at the time of testing, let us assume that feature number 2, f2, is missing. This means that those classifiers whose training feature sets included f2, that is, classifiers C2, C5, C7, and C8, cannot be used in classifying this instance. However, the remaining classifiers, shaded in Fig. 1.6b, did not use f2 during their training, therefore those classifiers can still be used.

Fig. 1.6
figure 6figure 6

(a) Training classifiers with random subsets of the features; (b) classifying an instance missing feature f2. Only shaded classifiers can be used

Learn \({}^{++}\).MF is listed in Algorithm 5 below. Perhaps the most important parameter of the algorithm is nof , the number of features, out of a total of f, to be used to train each classifier. Choosing a smaller nof allows a larger number of missing features to be accommodated by the algorithm. However, choosing a larger nof usually improves individual classifier performances. The primary assumption made by Learn \({}^{++}\).MF is that the dataset includes a redundant set of features, and the problem is at least partially solvable using a subset of the features, whose identities are unknown to us. Of course, if we knew the identities of those features, we would only use those features in the first place.

A theoretical analysis of this algorithm, including probability of finding at least one useable classifier in the absence of m missing features, when each classifier is trained using nof of a total of f features, as well as the number of classifiers needed to guarantee at least one useable classifier are provided in [57].

1.4.4 Learning from Nonstationary Environments: Concept Drift

Much of computational intelligence literature is devoted to algorithms that can learn from data that are assumed to be drawn from a fixed but unknown distribution. For a great many applications, however, this assumption is simply not true. For example, predicting future weather patterns from current and past climate data, predicting future stock returns from current and past financial data, identifying e-mail spam from current and past e-mail content, determining which online adds a user will respond based on the user’s past web surfing record, predicting future energy demand and prices based on current and past data are all examples of applications where the nature and characteristics of the data—and the underlying phenomena that generate such data—may change over time. Therefore, a learning model trained at a fixed point in time—and a decision boundary generated by such a model—may not reflect the current state of nature due to a change in the underlying environment. Such an environment is referred to as a nonstationary environment, and the problem of learning in such an environment is often referred to as learning concept drift. More specifically, given the Bayes posterior probability of class ω that a given instance x belongs, P(ω | x) = P(x | ω)P(ω)/P(x), concept drift can be formally defined as any scenario where the posterior probability changes over time, i.e., Pt + 1(ω | x)≠Pt(ω | x).

To be sure, this is a very challenging problem in machine learning because the underlying change may be gradual or rapid, cyclical or noncyclical, systematic or random, with fixed or variable rate of drift, and with local or global activity in the feature space that spans the data. Furthermore, concept drift can also be perceived, rather than real, as a result of insufficient, unknown, or unobservable features in a dataset, a phenomenon known as hidden context [58]. In such a case, an underlying phenomenon provides a true and static description of the environment over time, which, unfortunately, is hidden from the learner’s view. Having the benefit of knowing this hidden context would make the problem to have a fixed (and hence stationary) distribution.

Concept drift problems are usually associated with incremental learning or learning from a stream of data, where new data become available over time. Combining several authors’ suggestions for desired properties of a concept drift algorithms, Elwell and Polikar provided the following guidelines for addressing concept drift problems: (1) any given instance of data—whether provided online or in batches—can only be used once for training (one-pass incremental learning); (2) knowledge should be labeled with respect to its relevance to the current environment, and be dynamically updated as new data continuously arrive; (3) the learner should have a mechanism to reconcile when existing and newly acquired knowledge conflict with each other; (4) the learner should be able—not only to temporarily forget information that is no longer relevant to the current environment but also to recall prior knowledge if the drift/change in the environment follow a cyclical nature; and (5) knowledge should be incrementally and periodically stored so that it can be recalled to produce the best hypothesis for an unknown (unlabeled) data instance at any time during the learning process [59].

Earliest examples of concept drift algorithms use a single classifier to learn from the latest batch of data available, using some form of windowing to control the batch size. Successful examples of this instance selection approach include STAGGER [60] and FLORA [58] algorithms, which use a sliding window to choose a block of (new) instances to train a new classifier. The window size can be dynamically updated using a “window adjustment heuristic,” based on how fast the environment is changing. Instances that fall outside of the window are then assumed irrelevant and hence the information carried by them are irrecoverably forgotten. Other examples of this window-based approach include [616263], which use different drift detection mechanisms or base classifiers. Such approaches are often either not truly incremental as they may access prior data, or cannot handle cyclic environments. Some approaches include a novelty (anomaly) detection to determine the precise moment when changes occur, typically by using statistical measures, such as control charts based CUSUM [6465], confidence interval on error [6667], or other statistical approaches [68]. A new classifier trained on new data since the last detection of change then replaces the earlier classifier(s).

The ensemble-based algorithms provide an alternate approach to concept drift problems. These algorithms generally belong to one of three categories [69]: (1) update the combination rules or voting weights of a fixed ensemble, such as [7071]; an approach loosely based on Littlestone’s Winnow [72] and Freund and Schapire’s Hedge (a precursor of AdaBoost) [4]; (2) update the parameters of existing ensemble members using an online learner [6673]; and/or (3) add new members to build an ensemble with each incoming dataset. Most algorithms fall into this last category, where the oldest (e.g., Streaming Ensemble Algorithm (SEA) [74] or Recursive Ensemble Approach (REA) [75]) or the least contributing ensemble members are replaced with new ones (as in Dynamic Integration [76], or Dynamic Weighted Majority (DWM) [77]). While many ensemble approaches use some form of voting, there is some disagreement on whether the voting should be weighted, e.g., giving higher weight to a classifier if its training data were in the same region as the testing example [76], or unweighted, as in [7879], where the authors argue that weights based on previous data, whose distribution may have changed, are uninformative for future datasets. Other efforts that combine ensemble systems with drift detection include Bifet’s adaptive sliding window (ADWIN) [8081], also available within the WEKA-like software suite, Massive Online Analysis (MOA) at [82].

More recently, a new addition to Learn \({}^{++}\) suite of algorithms, Learn \({}^{++}\).NSE, has been introduced as a general framework to learning concept drift that does not make any restriction on the nature of the drift. Learn \({}^{++}\).NSE (for N}onS}tationary E}nvironments) inherits the dynamic distribution-guided ensemble structure and incremental learning abilities of all Learn \({}^{++}\) algorithms (hence strictly follows the one-pass rule). Learn \({}^{++}\).NSE trains a new classifier for each batch of data it receives, and combines the classifiers using a dynamically weighted majority voting. The novelty of the approach is in determining the voting weights, based on each classifier’s time-adjusted accuracy on current and past environments, allowing the algorithm to recognize, and act accordingly, to changes in underlying data distributions, including possible reoccurrence of an earlier distribution [59].

The Learn \({}^{++}\).NSE algorithm is listed in Algorithm 6, which receives the training dataset \({\mathrm{D}}^{t} = \left \{{x}_{i}^{t} \in X;{y}_{i}^{t} \in Y \right \},\quad i = 1,...,{m}^{t}\), at time t. Hence xit is the ith instance of the dataset, drawn from an unknown distribution Pt( x,y), which is the currently available representation of a possibly drifting distribution at time t. At time t + 1 , a new batch of data is drawn from Pt + 1(x,y). Between any two consecutive batches, the environment may experience a change whose rate is not known, nor assumed to be constant. Previously seen data are not available to the algorithm, allowing Learn \({}^{++}\).NSE to operate in a truly incremental fashion.

The algorithm is initialized with a single classifier on the first batch of data. With the arrival of each subsequent batch of data, the current ensemble, Ht − 1—the composite hypothesis of all individual hypotheses previously generated, is first evaluated on the new data (Step 1 in Algorithm 6). In Step 2, the algorithm identifies those examples of the new environment that are not recognized by the existing ensemble, Ht − 1, and updates the penalty distributionDt. This distribution is used not for instance selection, but rather to assign penalties to classifiers on their ability to identify previously seen or unseen instances. A new classifier ht, is then trained on the current training data in Step 3. In Step 4, each classifier generated thus far is evaluated on the training data weighted with respect to the penalty distribution. Note that since classifiers are generated at different times, each classifier receives a different number of evaluations: at time t, ht receives its first evaluation, whereas h1 is evaluated for tth time. We use εkt, k = 1, . . . , t to denote the error of hk—the classifier generated at time step k—on dataset Dt. Higher weight is given to classifiers that correctly identify previously unknown instances, while classifiers that misclassify previously known data are penalized. Note that if the newest classifier has a weighted error greater than 1 ∕ 2, i.e., if εk = tt ≥ 1 ∕ 2, this classifier is discarded and replaced with a new classifier. Older classifiers, with error εk < tt ≥ 1 ∕ 2, however, are retained but have their error saturated at 1 ∕ 2 (which later corresponds to zero vote on that environment). The errors are then normalized, creating βkt that fall in the [0, 1] range.

In Step 5, classifier error is further weighted (using a sigmoid function) with respect to time so that recent competence (error rate) is considered more heavily. Such a sigmoid-based weighted averaging also serves to smooth out potential large swings in classifiers errors that may be due to noisy data rather than actual drift. Final voting weights are determined in Step 6 as log-normalized reciprocals of the weighted errors: if a classifier performs poorly on the current environment, it receives little or no weight, and is effectively—but only temporarily—removed from the ensemble. The classifier is not discarded; however, it is recalled through assignment of higher voting weights if it performs well on future environments. Learn \({}^{++}\).NSE forgets only temporarily, which is particularly useful in cyclical environments. The final decision is obtained in Step 7 as the weighted majority voting of the current ensemble members.

Learn \({}^{++}\).NSE has been evaluated and benchmarked against other algorithms, on a broad spectrum of real-world as well as carefully designed synthetic datasets—including gradual and rapid drift, variable rate of drift, cyclical environments, as well as environments that introduce or remove concepts. These experiments and their results are reported in [59], which shows that the algorithm can serve as a general framework for learning concept drift regardless of the environment that characterizes the drift.

1.4.5 Confidence Estimation

In addition to the various machine learning problems described above, ensemble systems can also be used to address other challenges that are difficult or impossible using a single classifier-based systems.

One such application is to determine the confidence of the (ensemble-based) classifier in its own decision. The idea is extremely intuitive as it directly follows the use of ensemble systems in our daily lives. Consider reading user reviews of a particular product, or consulting the opinions of several physicians on the risks of a particular medical procedure. If all—or at least most—users agree in their opinion that the product reviewed is very good, we would have higher confidence in our decision to purchase that item. Similarly, if all physicians agree on the effectiveness of a particular medical operation, then we would feel more comfortable with that procedure. On the other hand, if some of the reviews are highly complementary, whereas others are highly critical that casts doubt in our decision to purchase that item. Of course, in order for our confidence in the “ensemble of reviewers” to be valid, we must believe that the reviewers are independent of each other, and indeed independently review the items. If certain reviewers were writing reviews based on other reviewers’ reviews they read, the confidence based on the ensemble becomes meaningless.

This idea can be naturally extended to classifiers. If considerable majority of the classifiers in an ensemble agree on their decisions, than we can interpret that outcome as ensemble having higher confidence in its decision, as opposed to only a mere majority of classifiers choosing a particular class. In fact, under certain conditions, the consistency of the classifier outputs can also be used to estimate the true posterior probability of each class [28]. Of course, similar to the examples given above, the classifier decisions must be independent for this confidence—and the posterior probabilities—to be meaningful.

1.5 Summary

Ensemble-based systems provide intuitive, simple, elegant, and powerful solutions to a variety of machine learning problems. Originally developed to improve classification accuracy by reducing the variance in classifier outputs, ensemble-based systems have since proven to be very effective in a number of problem domains that are difficult to address using a single model-based system.

A typical ensemble-based system consists of three components: a mechanism to choose instances (or features), which adds to the diversity of the ensemble; a mechanism for training component classifiers of the ensemble; and a mechanism to combine the classifiers. The selection of instances can either be done completely at random, as in bagging, or by following a strategy implemented through a dynamically updated distribution, as in boosting family of algorithms. In general, most ensemble-based systems are independent of the type of base classifier used to create the ensemble, a significant advantage that allows using a specific type of classifier that may be known to be best suited for a given application. In that sense, ensemble-based systems are also known as algorithm-free-algorithms.

Finally, a number of different strategies can be used to combine the classifiers, though sum rule, simple majority voting and weighted majority voting are the most commonly used ones due to certain theoretical guarantees they provide.

We also discussed a number of problem domains on which ensemble systems can be used effectively. These include incremental learning from additional data, feature selection, addressing missing features, data fusion, and learning from nonstationary data distributions. Each of these areas has several algorithms developed to address the relevant specific issue, which are summarized in this chapter. We also described a suite of algorithms, collectively known as Learn \({}^{++}\) family of algorithms that is capable of addressing all of these problems with proper modifications to the base approach: all Learn \({}^{++}\) algorithms are incremental algorithms that use an ensemble of classifiers trained on the current data only, then combined through majority voting. The individual members of Learn \({}^{++}\) differ from each other according to the particular distribution update rule along with a creative weight assignment that is specific to the problem.