Keywords

1 Introduction

By combining the predictions of multiple diverse classifiers, an ensemble of classifiers can perform better than any one classifier can. We propose a novel ensemble classifier that can more accurately classify unseen data than the current state-of-the-art. The proposed algorithm f takes some \(n\times m\) data X as input, where the rows \(x\in X\) represent independently and identically distributed samples from some universe \(\mathcal {X}\), each with a label y that represents the classification category that x belongs to from some output space Y. f(X) is trained using X, learning the underlying patterns in the data that determine what category \(y\in Y\) any particular datum x will belong to. The aim of ensemble classification is then to use f(X) to predict the label y of unseen data \(z\in Z\), which comes from the same universe \(\mathcal {X}\) as X [9]. The columns of X represent features A that describe the properties of each row (i.e., record) x, and it is these features that a classification algorithm uses to learn how to classify unseen data [9]. Examples of classification algorithms include support vector machines [28] and decision trees [21]. By training multiple classifiers, and then combining the predictions made by each classifier into a final overall prediction, an ensemble of classifiers can outperform any individual classifier [8, 25].

1.1 Contributions

We propose a generalized ensemble classification algorithm that uses data diversity and base-classifier diversity to build a decision model with low classification error. Our novel contributions are:

  • We maximize data diversity by creating subsets of data with large differences in distribution, using k-Means clustering for a large range of k values, \(k=1,\ldots ,K\). We ensure that each cluster has sufficient data for training the base-classifiers by bounding K such that if there are k clusters, there are at least an average of \(k^2\) records in each cluster.

  • During the incremental clustering process, we compare the new clusters to all previous clusters, and remove (i.e., prune) the new cluster if it is sufficiently similar to a previous cluster. This prevents the prediction voting process from being biased by repetitious votes.

  • When classifying a new record z, rather than inputting it into all base-classifiers, we only input the record into the classifiers built using the K clusters that are closest to the new record. This is done regardless of which clusterings the K clusters came from.

After presenting related work in Sect. 2, we introduce the proposed approach in Sect. 3, exploring each component in separate subsections. We then empirically test the proposed algorithm in Sect. 4, before concluding the paper in Sect. 5.

2 Related Work

The notion of “diversity” when building ensemble classifiers has been researched extensively in the past [1, 6, 14, 24]. Despite being difficult to define precisely [6], the overall concept is straight-forward: if all the base-classifiers in an ensemble make the same predictions, they are also making the same mistakes, and if they are all making the same mistakes, there is no advantage in having more than one of them. By diversifying the predictions that the base-classifiers make, the ensemble can perform better than the sum of its parts. There are several types of diversity that an ensemble algorithm can achieve.

Data diversity is achieved by sampling subsets of data from an original dataset, in a way that causes the predictions made by classifiers trained on the subsets to differ from one another. This can be achieved by selecting either records or features (or both) from the original dataset. Duplicating records or features across the new sets of data is viable (such as bagging [2] or random feature subspaces [29]), as is creating mutually-exclusive sets (such as clustering [22, 30, 32]). The manipulation of data has successfully diversified the data if the end result is that a diversity of predictions is outputted [14].

Classifier diversity (or “structural diversity” [24] or “heterogeneous ensembling” [18]) has a similar goal of diversifying the predictions that the base-classifiers output. Classifier diversity is achieved by using different classifier algorithms that learn from the data in different ways. By using a variety of classifiers, each with their own advantages and disadvantages, the outputs of the classifiers are diverse [4].

As an example, the Random Forest algorithm [3] uses bagging [2] and random feature subspaces [10, 29] to achieve data diversity. It only builds an ensemble of decision trees though, and thus does not target classifier diversity. In this paper, we use both types of diversity in our proposed approach. These are explored below in Sect. 3.

3 Proposed Approach

We first provide an overview of the proposed approach in Sect. 3.1. We then investigate each novel component of the approach one-by-one in the proceeding subsections.

3.1 Overview

The approach can be summarized in the following steps:

  • Step 1: Calculate the largest number of clusters K we can partition the training data X into without reducing the average number of records in each cluster below the square of the number of clusters.

  • Step 2a: For \(k=1,\ldots ,K\) partition the data X using k-Means clustering.

  • Step 2b: For each new cluster created, compare its similarity (in terms of the records it contains) to all previously created clusters, using the Jaccard Index [11]. Remove a new cluster if it is very similar to a previous cluster.

  • Step 3a: For each remaining cluster, check if all the records in the cluster have the same class label y (i.e., the cluster is homogeneous). If so, skip 3b, and future records that are filtered to this cluster will be predicted to have the same label that all the training records had. In effect, the cluster will output v votes for the label y, rather than training v base-classifiers from the homogeneous data.

  • Step 3b: For each cluster not addressed by 3a, build v base-classifiers using the data in the cluster. Examples of base-classifiers include a decision tree [21], a support vector machine [28], a naive Bayes model [20], a discriminant analysis model [19], a k-nearest neighbors model [31] and a randomly under-sampled boosted model (RUSBoost) [27]. Finding the optimal set of base-classifiers is part of future work.

  • Step 4: Predict the label of new records z by filtering them into the K closest clusters, using the base-classifiers built from those clusters to each vote on a label, and then using the majority vote as the final prediction.

The specifics of each of these steps are explored below in the following subsections.

3.2 Achieving Data Diversity

To produce a diverse set of base-classifiers, we diversify the training data. Previous research supports using bagging to achieve this [5, 16, 26], however we argue that the benefits of bagging are in reducing the variance of the models [2], not in promoting diversity. Because bagging maintains the distribution of the underlying data with increasing detail as the sample size increases, it does not provide a diverse range of distributions to the base-classifiers. This problem can be avoided by clustering the data instead, finding regions of data with many attributes in common, and few attributes in common with other regions of data. We achieve this using k-Means clustering [12].

We could find a single optimal value for k, and limit our ensemble of base-classifiers to a single clustering. Instead though, we propose using a range of values for k, and building a much larger ensemble. By using values of k ranging from 1 to some upper bound K, we increase diversity further by finding clusters of different sizes (and different distributions) in the training data.

3.3 Choosing K

We need an appropriate K value that gives us a large set of clusters to build many diverse classifiers from, but also provides enough data in each cluster to meaningfully train the base-classifiers with. We balance these two goals with the following heuristic:

  1. 1.

    Let \(n_k\) equal the average number of records in each cluster created from k-Means clustering. For a number of clusters k, \(n_k=n/k\), where n is the total number of records in the dataset.

  2. 2.

    We limit the maximum size of k such that \(n_k \ge k^2\). That is, each cluster has an average number of records equal to at least the square of the number of clusters.

  3. 3.

    Thus we have \(n/k \ge k^2\). Re-arranging this formula gives us: \(n \ge k^3\).

  4. 4.

    The maximum number of clusters K is therefore:

    $$ K = \left\lfloor \root 3 \of {n} \right\rfloor . $$
  5. 5.

    We define the minimum k value for k-Means clustering at \(k=1\).

  6. 6.

    Our proposed ensemble classifier therefore executes k-Means clustering K times, for \(k=1,\ldots ,K\).

This balances the number of clusters with the size of each cluster. It is based on a similar concept used to bound k-Means clustering when using it on its own [12, 23].

3.4 Pruning Repeated Clusters

In Sects. 3.2 and 3.3, we described how the proposed algorithm creates an increasing number of clusters, from 1 to K, using k-Means clustering. This results in a total of \(\sum _{i=1}^K i\) clusters, or in other words, \(K(K+1)/2\) clusters. This creates a large set of clusters from which to then train base-classifiers from, which will in turn be used to vote on predicted class labels.

However there is a risk in using all of these clusters to train classifiers. If two clusters, made during different clusterings (i.e., when \(k=i\), and then when \(k=j\) such that \(i\ne j\)), contain all the same records, then the classifiers built from those two clusters will be very similar, maybe even identical (since there is zero data diversity). Not only does this waste computation time, but it also means that when voting on the final predicted label, the votes from these classifiers are doubling up (i.e. getting two votes), biasing the ensemble towards their output.

Table 1. Average difference in classification error compared to when \(\theta =0.9\), across the eight datasets presented in Table 2.
Table 2. Details of the eight datasets we use in our experiments, taken from the UCI Machine Learning Repository [15].

We remove this bias using the following process: as we grow k towards K, each cluster we create is compared to all previous clusters to check if it is sufficiently diverse. For each new cluster c created in clustering k, we compare the records in c (\(X_c\)) to the records of each cluster u in the set of accepted clusters U using the Jaccard Index [11]:

$$ J(c, u) = \frac{X_c \cap X_u}{X_c \cup X_u} ; \forall u\in U . $$

Computationally, we can calculate J(cu) using the indexes of the records in X, rather than comparing the contents of each \(x\in X\). If there are no records in common, \(J(c, u)=0\); and \(J(c, u)=1\) if both clusters contain precisely the same set of records. If, for some u, \(J(c, u)>\theta \), c is not added to U. Here, \(\theta \) represents a threshold of similarity, which we empirically demonstrate is ideally placed at \(\theta =0.9\) in Table 1. Table 1 presents the average difference in classification error, across the eight datasets presented in Table 2, when \(\theta =0.5,0.6,0.7,0.8,1.0\) (and when there is no pruning) compared to when \(\theta =0.9\).

3.5 Achieving Classifier Diversity

We then build a collection of base-classifiers from the data in each non-pruned cluster. For our experiments in this paper, we use the following six classifiers: a decision tree, a support vector machine, a naive Bayes model, a discriminant analysis model, a k-nearest neighbors model and a randomly under-sampled boosted model. This collection of classifiers is independent of the proposed ensemble framework, and future work will involve investigating the optimal amount and types of classifiers to use.

This diverse collection of classifiers enables the ensemble to discover correlations and patterns in the data that would not be discovered if we limited ourselves to a single classifier, such as what Random Forest does [3]. By discovering different patterns with different classifiers, we diversify the errors made by each base-classifier, which in turn reduces the final classification error (as discussed in Sects. 1 and 2). We can see in Table 5 (presented later in Sect. 4) that we are able to outperform Random Forest on almost all datasets.

3.6 Classifying New Records

Once the ensemble has been built and trained (Steps 1–3 in Sect. 3.1), our model is ready to predict the label of unseen records. When inputting an unseen record z into our ensemble classifier, we propose finding the K clusters in U whose centroids are closest to z, and using the base-classifiers made from those K clusters to predict the label of z.

This approach means that clusters made from different clusterings (when k had different values) are not treated differently from clusters built from the same clustering; if the centroids of two clusters from the same clustering are closer to z than the centroids of two clusters from different clusterings, the closer clusters are used. We also do not want to use the base-classifiers made from every cluster to classify z. Many of the base-classifiers were trained on data that had very different distributions to the distributions that z follows, and did not contain any records that resemble z. To use those classifiers to predict z therefore makes little sense.

4 Experiments and Results

Here we present experiments that cover both individual components of the proposed algorithm, and the overall performance of the algorithm compared to the current state-of-the-art. All experiments are performed using stratified five-fold cross-validation, repeated ten times and aggregated. We perform our experiments on eight datasets from the UCI Machine Learning Repository [15]. The details of the datasets are presented in Table 2. We use Matlab’s implementation of k-Means and the base-classifiers for our experiments [17]. We use the default settings in all cases, except for the following:

  • The maximum number of iterations for k-Means is increased from 100 to 500, to ensure that centroid convergence occurs.

  • Regularization is turned off for discriminant analysis models, to prevent the software throwing an error if a feature with zero variance is inputted.

  • Kernel smoothing density estimation is used when building naive Bayes models, instead of using Gaussian distributions, to avoid the software throwing an error if a feature with zero variance is inputted.

4.1 Assessing Cluster Size and Pruning

The first step in our proposed algorithm is to define K. We argue that \(\root 3 \of {n}\) is an appropriate value of K, and this is supported empirically by the results seen in Table 3. In Table 3, we compare \(K=\root 3 \of {n}\) to one smaller value (\(\root 4 \of {n}\)) and one larger value (\(\sqrt{n}\)) of K. Classification error is lowest when \(K=\root 3 \of {n}\) for six of the eight datasets, and very close to lowest for the remaining two (within one standard deviation).

Table 3. The classification error for three different values of K.
Table 4. The change in classification error with and without cluster pruning.

As part of Step 2, we propose removing repeated clusters using the Jaccard Index. Based on the empirical results of Table 1, we recommend setting the similarity threshold to \(\theta =0.9\). This removal of repeated clusters represents a large saving in computation time, preventing v (in our experiments, \(v=6\)) redundant classifiers from being trained per removed cluster. It also represents the removal of a high number of repeated votes. The reduction in classification error because of this removal of biased votes can be seen in Table 4. For seven of the eight datasets, the classification error after pruning repeated clusters is lower than or equal to the error without this pruning. On average across all datasets, the average reduction in error is 1.1 percentage points.

4.2 Comparison with Other Ensemble Algorithms

Table 5 presents the classification error our proposed algorithm achieves on eight datasets, compared to the classification error of three other algorithms. One of these algorithms, Random Forest, is included as a benchmark ensemble algorithm due to its reputation as a consistently high-performing algorithm [7]. The other two algorithms represent the current state-of-the-art in ensemble classification, with the results presented being the results the authors reported in their respective papers: Kuncheva and Rodriguez [13]; and Zhang and Suganthan [33]. In both cases, we present the results for the highest performing version of their proposed algorithms; the naive Bayes (NB) version of Kuncheva and Rodriguez’s [13], and the version of Zhang and Suganthan’s that uses oblique rotation forests with axis-parallel splits (MPRRoF-P) [33].

Table 5. The classification error results for four ensemble algorithms, including our proposed approach.

Out of the eight datasets, the approach proposed in this paper has the lowest classification error in five cases. It has the second-lowest in two cases, and the third-lowest for one dataset (Segmentation). Interestingly, as we saw in Table 4, the Segmentation dataset is also the only dataset for which our proposed cluster pruning does not perform well. This explains the sub-par performance compared to the state-of-the-art for this dataset.

5 Conclusion

Diversity is crucial for building a high-performing ensemble classifier. By performing K clusterings of different sizes, and using these clusters as training data for a diverse set of base-classifiers, the error of the ensemble classifier is reduced. Not only that, but by first pruning repeated clusters, biased votes can be removed from the majority voting process, further reducing classification error. On average, pruning repeated clusters reduces classification error by 1.1%. Looking forward, we plan to investigate what factors affect classifier diversity, and how classifier diversity impacts the performance of ensemble classification.