Synonyms

Classifier combination; Committee-based learning; Multiple classifier systems

Definition

Ensemble learning is a machine learning paradigm where multiple learners are trained to solve the same problem. In contrast to ordinary machine learning approaches which try to learn one hypothesis from training data, ensemble methods try to construct a set of hypotheses and combine them to use.

Introduction

An ensemble contains a number of learners which are usually called base learners. The generalization ability of an ensemble is usually much stronger than that of base learners. Actually, ensemble learning is appealing because it is able to boost weak learners which are slightly better than random guess to strong learners which can make very accurate predictions. So, “base learners” are also referred to as “weak learners.” It is noteworthy, however, that although most theoretical analyses work on weak learners, base learners used in practice are not necessarily weak since using not-so-weak base learners often results in better performance.

Base learners are usually generated from training data by a base learning algorithm which can be decision tree, neural network, or other kinds of machine learning algorithms. Most ensemble methods use a single base learning algorithm to produce homogeneous base learners, but there are also some methods which use multiple learning algorithms to produce heterogeneous learners. In the latter case there is no single base learning algorithm, and thus, some people prefer calling the learners individual learners or component learners to “base learners,” while the names “individual learners” and “component learners” can also be used for homogeneous base learners.

It is difficult to trace the starting point of the history of ensemble methods since the basic idea of deploying multiple models has been in use for a long time, yet it is clear that the hot wave of research on ensemble learning since the 1990s owes much to two works. The first is an applied research conducted by Hansen and Salamon [1] at the end of 1980s, where they found that predictions made by the combination of a set of classifiers are often more accurate than predictions made by the best single classifier. The second is a theoretical research conducted in 1989, where Schapire [2] proved that weak learners can be boosted to strong learners, and the proof resulted in boosting, one of the most influential ensemble methods.

Constructing Ensembles

Typically, an ensemble is constructed in two steps. First, a number of base learners are produced, which can be generated in a parallel style or in a sequential style where the generation of a base learner has influence on the generation of subsequent learners. Then, the base learners are combined to use, where among the most popular combination schemes are majority voting for classification and weighted averaging for regression.

Generally, to get a good ensemble, the base learners should be as more accurate as possible and as more diverse as possible. This has been formally shown by Krogh and Vedelsby [3] and emphasized by many other people. There are many effective processes for estimating the accuracy of learners, such as cross-validation, hold-out test, etc. However, there is no rigorous definition on what is intuitively perceived as diversity. Although a number of diversity measures have been designed, Kuncheva and Whitaker [4] disclosed that the usefulness of existing diversity measures in constructing ensembles is suspectable. In practice, the diversity of the base learners can be introduced from different channels, such as subsampling the training examples, manipulating the attributes, manipulating the outputs, injecting randomness into learning algorithms, or even using multiple mechanisms simultaneously. The employment of different base learner generation processes and/or different combination schemes leads to different ensemble methods.

There are many effective ensemble methods. The following will briefly introduce three representative methods: boosting [2, 5], bagging [6] and stacking [7]. Here, binary classification is considered for simplicity. That is, let \(\mathcal{X}\) and \(\mathcal{Y}\) denote the instance space and the set of class labels, respectively, assuming \(\mathcal{Y} = \left \{-1,+1\right \}\). A training data set \(\mathcal{D} = \left \{\left (\mathbf{x}_{1},y_{1}\right ),\left (\mathbf{x}_{2},y_{2}\right ),\cdots \,,\left (\mathbf{x}_{m},y_{m}\right )\right \}\) is given, where \(\mathbf{x}_{i} \in \mathcal{X}\) and \(y_{i} \in \mathcal{Y}\left (i = 1,\ldots ,m\right )\).

Boosting is in fact a family of algorithms since there are many variants. Here, the most famous algorithm, AdaBoost [5], is considered as an example. First, it assigns equal weights to all the training examples. Denote the distribution of the weights at the t-th learning round as D t . From the training data set and D t , the algorithm generates a base learner \(h_{t} : \mathcal{X} \rightarrow \mathcal{Y}\) by calling the base learning algorithm. Then, it uses the training examples to test h t , and the weights of the incorrectly classified examples will be increased. Thus, an updated weight distribution Dt+1 is obtained. From the training data set and Dt+1, AdaBoost generates another base learner by calling the base learning algorithm again. Such a process is repeated for T times, each of which is called a round, and the final learner is derived by weighted majority voting of the T base learners, where the weights of the learners are determined during the training process. In practice, the base learning algorithm may be a learning algorithm which can use weighted training examples directly; otherwise the weights can be exploited by sampling the training examples according to the weight distribution D t . The pseudo-code of AdaBoost is shown in Fig. 1.

Ensemble Learning, Fig. 1
figure 528figure 528

The AdaBoost algorithm

Bagging [6] trains a number of base learners each from a different bootstrap sample by calling a base learning algorithm. A bootstrap sample is obtained by subsampling the training data set with replacement, where the size of a sample is the same as that of the training data set. Thus, for a bootstrap sample, some training examples may appear but some may not, where the probability that an example appears at least once is about 0.632. After obtaining the base learners, bagging combines them by majority voting and the most-voted class is predicted. The pseudo-code of bagging is shown in Fig. 2. It is worth mentioning that a variant of bagging, Random Forests [8], has been deemed as one of the most powerful ensemble methods up to date.

Ensemble Learning, Fig. 2
figure 529figure 529

The bagging algorithm

In a typical implementation of stacking [7], a number of first-level individual learners are generated from the training data set by employing different learning algorithms. Those individual learners are then combined by a second-level learner which is called as meta-learner. The pseudo-code of stacking is shown in Fig. 3. It is evident that stacking has close relation with information fusion methods.

Ensemble Learning, Fig. 3
figure 530figure 530

The stacking algorithm

Generally speaking, there is no ensemble method which outperforms other ensemble methods consistently. Empirical studies on popular ensemble methods can be found in many papers such as [911]. Previously, it was thought that using more base learners will lead to a better performance, yet Zhou et al. [12] proved the “many could be better than all” theorem which indicates that this may not be the fact. It was shown that after generating a set of base learners, selecting some base learners instead of using all of them to compose an ensemble is a better choice. Such ensembles are called selective ensembles.

It is worth mentioning that in addition to classification and regression, ensemble methods have also been designed for clustering [13] and other kinds of machine learning tasks.

Why Ensembles are Superior to Singles

To understand why the generalization ability of an ensemble is usually much stronger than that of a single learner, Dietterich [14] gave three reasons by viewing the nature of machine learning as searching a hypothesis space for the most accurate hypothesis. The first reason is that the training data might not provide sufficient information for choosing a single best learner. For example, there may be many learners who perform equally well on the training data set. Thus, combining these learners may be a better choice. The second reason is that the search processes of the learning algorithms might be imperfect. For example, even if there exists a unique best hypothesis, it might be difficult to achieve since running the algorithms results in suboptimal hypotheses. Thus, ensembles can compensate for such imperfect search processes. The third reason is that the hypothesis space being searched might not contain the true target function, while ensembles can give some good approximation. For example, it is well known that the classification boundaries of decision trees are linear segments parallel to coordinate axes. If the target classification boundary is a smooth diagonal line, using a single decision tree cannot lead to a good result, yet a good approximation can be achieved by combining a set of decision trees. Note that those are intuitive instead of rigorous theoretical explanations.

There are many theoretical studies on famous ensemble methods such as boosting and bagging, yet it is far from a clear understanding of the underlying mechanism of these methods. For example, empirical observations show that boosting often does not suffer from overfitting even after a large number of rounds, and sometimes it is even able to reduce the generalization error after the training error has already reached zero. Although many researchers have studied this phenomenon, theoretical explanations are still in arguing.

The bias-variance decomposition is often used in studying the performance of ensemble methods [9, 12]. It is known that bagging can significantly reduce the variance, and therefore, it is better to be applied to learners who suffered from large variance, e.g., unstable learners such as decision trees or neural networks. Boosting can significantly reduce the bias in addition to reducing the variance, and therefore, on weak learners, such as decision stumps, boosting is usually more effective.

Applications

Ensemble learning has already been used in diverse applications such as optical character recognition, text categorization, face recognition, computer-aided medical diagnosis, gene expression analysis, etc. Actually, ensemble learning can be used wherever machine learning techniques can be used.

Summary

Ensemble learning is a powerful machine learning paradigm which has exhibited apparent advantages in many applications. By using multiple learners, the generalization ability of an ensemble can be much better than that of a single learner. A serious deficiency of current ensemble methods is the lack of comprehensibility, i.e., the knowledge learned by ensembles is not understandable to the user. Improving the comprehensibility of ensembles [15] is an important yet largely understudied direction. Another important issue is that currently no diversity measures are satisfying [4] although it is known that diversity plays an important role in ensembles. If those issues can be addressed well, ensemble learning will be able to contribute more to more applications.

Related Entries