Keywords

1 Introduction

Ensemble learning have shown many theoretical and practical benefits compared to the use of a single best model [13, 18]. As opposed to using a single predictor, ensemble methods have statistical benefits acquired from combining the output of several predictors. It provides a divide and conquer strategy that a single predictor is incapable of achieving when the problem is too difficult and provides a more accurate representation of the data when the data is generated from different sources (data fusion).

An early example of the use of ensemble methods in literature is presented in [7], where the feature space is partitioned using two or more classifiers. In the nineties, two of the most widely used ensemble methods where proposed, these are: Boosting [19] and Bagging [3]. Schapire introduced Boosting algorithm in [19] where the author showed that a strong learner can be built by combining a number of weak learners. The introduction of Boosting has led to the development of AdaBoost and its many variations to solve multi-class and regression problems. Meanwhile, Breiman introduced Bagging in [3], where the base predictors are trained on bootstrap replicas of the training data. In addition to these two algorithms, many well performing ensemble methods were developed and used in a wide area of applications, such as stacked generalization [20], mixture of experts [10] and negative correlation learning [8] among others.

In literature, it has been shown that there are two conditions for an ensembles to perform better than a single predictor. These are that the base predictors should be diverse (their error correlation is reduced) and that they have a reasonable level of performance [18]. In ensemble learning diversity has been acknowledged as an important characteristic [6]. An ensemble with diverse models can have better performance due to the complementary behaviour of its components [21], however, as shown in [17] the diversity measure used has to be chosen carefully so it works with the used combiner.

The work presented in this paper builds on our broader investigations of multilevel structures of classifiers and predictors [11, 14, 16, 18] and directly follows from our previous work presented in [1]. It discusses diversity as a characteristic of an MCMLPS, and investigates its effect on the accuracy of prediction.

The organization of this paper is: in Sect. 2 a new type of ensemble system is introduced. Section 3 explores the methodology and the design cycle of the proposed locally trained MCMLPS. Section 4 discusses the experimental work in the paper and the obtained results. It compares the testing accuracy of the system with four benchmark algorithms and studies the relation between the overall accuracy of the ensemble and the amount of disagreements among the base predictors. Section 5 explores a number of variations in the parameters of the proposed systems. Finally Sect. 6 draws the main conclusions in the paper.

2 Multi-Component, Multi-Layer Predictive Systems

The MCMLPS used in this study was introduced in [1] and it is shown in Fig. 1; where \(w_{11},\)...\(,w_{nk}\), are the weights of the first layer, n represent the number of the base ensembles and k represent the number of the models inside the base ensembles. Furthermore, \(w_{1},\)...\(,w_{n}\) are the weights of the second layer for the n base ensembles. \(M_{1},...,M_{k}\) are the base predictors of the first layer ensembles, \(g_{1},...,g_{n}\) are the ensembles created from combining the base predictors, h(x) is the second layer combiner and \(\hat{Y}\) is the final prediction of the system. Let X be the data set containing the training objects, C represent the number of classes, \(\theta _{c}\) represent the actual class and \(M_{k}^{n}\) represent the output prediction of the model (shown in the first layer of the ensemble in Fig. 1), where \(M_{k}^{n}=1\) for class \(\theta _{c}\) and 0 otherwise and \(c=1,..,C\). The outputs of the base predictors \(M_{k}^{n}\) and the ensemble \(g_{n}\) are given as c-dimensional binary vectors where \([M_{1}^{j},..,M_{k}^{j}]^T\) \(\in \) \(\{0,1\}^c\) and \([g_{1},.., g_{j}]^T\) \(\in \) \([0,1]^c\), j=1,...,n respectively. Equations 1 and 2 show the mathematical representation for the ensembles generated from the first layer:

$$\begin{aligned} g_{j}(x)=\Sigma _{i=1}^{k} w_{1j} M_{i}^{(j)}(x) \end{aligned}$$
(1)

and let

$$\begin{aligned} d_{j,c}(x)={\left\{ \begin{array}{ll} 1 &{} if\ g_{j}(x)\ =\theta _{c},\\ 0 &{} otherwise.\end{array}\right. } \end{aligned}$$
(2)

Then the second layer ensemble is as:

$$\begin{aligned} h(x)= \Sigma _{j=1}^{m} w_{2j} d_{j,c} \end{aligned}$$
(3)

and the final prediction of the system is:

$$\begin{aligned} \hat{Y}= \arg \max _{c} h(x). \end{aligned}$$
(4)
Fig. 1.
figure 1

The multi-component multi-layer predictive system.

3 Designing MCMLPS: Methodology

Despite the similarities between the MCMLPS presented in this paper with that presented in [1], there is a key difference between these systems. The approach introduced in [1] is an unsupervised learning approach, where the base predictors are trained on disjoint sets of the data for which only subsets of the features are selected. Meanwhile, the MCMLPS presented in this paper is a supervised learning approach in which the base predictors are trained on subsets of the features for all of the training data. Moreover, it uses a different similarity metric. The methodology used to built the MCMLPS encompasses the following phases: (a) data preparation and partitioning, (b) model generation and combination.

In order to validate and examine the generalization ability of the proposed architecture, the Density Preserving Sampling (DPS) [5] is used to partition the data. DPS divides the data into subsets that are representative of the whole data set [5]. In this work DPS is used to split the data into training and testing sets. The training data is assigned according to its features similarity to a set of LRs. The similarity is determined using mutual information based approach (discussed in Subsect. 3.1). Then DPS is used again to split the LRs data into K folds, where K models are trained on the data of the generated folds. The general design phases for the MCMLPS are discussed below:

  • Data preparation and partitioning:

    The data goes through three partitioning stages, first the whole data is split into training and testing sets, then the training set is allocated to the LRs and finally within the LRs the data is split into K subsets which are used to train the local models. Figure 2 shows the preparation and partitioning of the data, where, \(F_{1},...F_{4}\) are the folds generated from the first DPS split, \(LR_{1},...LR_{N}\) are the LRs and \(M_{1},...M_{k}\) are the local models within the regions trained using data from the second DPS split.

Fig. 2.
figure 2

Data preparation and model generation.

The points given below summarise the procedure used in this phase:

  • Apply DPS to split the data into 4 representative folds.

  • Use 3 out of 4 folds for training and the last fold for testing.

  • Find the similarity matrix for the training data using the mutual information of the features.

  • Choose N rows from the similarity matrix to be the seeds for the LRs.

  • Add the training data to the LRs according to the similarity of data features to the LRs seeds.

  • Apply k fold DPS to the LRs data.

  • Model generation, testing and combining:

    Once the data is assigned to the relevant LRs, the second DPS is applied to generate the K folds within the LRs and K models are trained on the LR folds. Furthermore, for all new instances N weights values are computed with respect to the N LRs. This phase can be summarized as follow:

  • Train a predictive model on each of the K LRs folds.

  • Compute the weights of the LRs votes using the similarity between the LRs seeds and the testing data.

In the first layer, N ensembles are generated from combining the models of the N LRs. While, in the second layer a single ensemble that combines the first layer N ensembles is generated. The combining method used is a weighted majority vote with the similarity of the LRs features used as the weights in both layers. The procedure is repeated for all four folds F1, ...F4, so that each time a different fold is used for testing.

3.1 Conditional Mutual Information Based LRs

This approach aims to split the feature space into a number of subsets based on their Conditional Mutual Information (CMI). The features with the highest CMI values are chosen to be the seeds for the LRs. The CMI is measured using the following equation [4]:

$$\begin{aligned} J_{cmi}(X_{k})=I(X_{k};Y)-I(X_{k};S)+I(X_{k},S|Y) \end{aligned}$$
(5)

where \(X_{k}\) is a single feature, Y is the output and S are the remaining features (all the features apart from \(X_{k}\)). \(I(X_{k};Y)\) is the mutual information between the feature \(X_{k}\) and the class Y, \(I(X_{k};S)\) is the redundancy of feature \(X_{k}\) with respect to the remaining features and \(I(X_{k},S|Y)\) is the conditional redundancy (the class dependency of \(X_{k}\) with the existing feature set S). According to [4] the equation given above shows that including correlated features can be useful, if the correlation of the features with the class is higher than their inner correlation. The benefits of including correlated features have been explored before by [9], where it has been observed that “correlation does not imply redundancy” .

Once the CMI values of the features are computed using Eq. 5, the highest N features are selected to be the seeds for the LRs. In order to add new features to the LRs, the similarity of the features to the LRs seeds need to be calculated. Equation 6 is used to determine the similarity between the features and the LRs seeds.

$$\begin{aligned} J_{cmi+}(X_{k})=I(X_{k};Y)+I(X_{k}; J_{cmi}(X_{k}))+I(X_{k},J_{cmi}(X_{k})|Y) \end{aligned}$$
(6)

In this equation the pairwise mutual information of the features with the LR seeds is calculated and the features that have the highest CMI with respect to the seeds are added to the LRs. By adding rather than subtracting the redundancy term \(I(X_{k}; J_{cmi}(X_{k}))\) this approach aims to group together similar features in the LRs. Each LR is assigned with a subset of the features, where all the features are ranked according to their mutual information with the seed of the LR and only the highest ranking features are assigned to the LR. The ratio of the features assigned to the LRs is \(\alpha \), where \(1>\alpha >0\).

In order to use this approach to build an MCMLPS, initially the data is split using the method presented in Fig. 2. DPS is also used to split the data into training and testing. Then the following steps are taken to split the training data into the N LRs:

  1. 1.

    Calculate the CMI among the training data features using Eq. 5.

  2. 2.

    Choose the highest scoring N features to be the seeds of the LRs.

  3. 3.

    For the remaining features, use Eq. 6 to rank the features according to their similarity to the LRs seeds.

  4. 4.

    Based on the features mutual information with the seeds, assign \(\alpha \) of the total number of features to the LRs.

In both layers weighted majority vote is used to combine the respective predictions, where the mutual information of the LRs features is used as the weighing vector. The weights of the predictions of the LRs models are calculated using the summation of the mutual information values of the LR features.

4 Results

The MI based MCMLPS introduced in this paper is applied to the data sets shown in Table 1. The data sets used are taken from the UCI machine learning archive [12]. The performance of this system is compared to correlation based MCMLPS [1], Rotation Forest (RF) [15], Bagging [3] and AdaBoost [19]. The settings for these benchmark algorithms is as follows:

  • MI based MCMLPS: 6 LR’s are used with each having 8 models (48 Decision Trees (DT’s) in total) trained on \(\alpha \) subset of the features.

  • Correlation based MCMLPS: 6 LR’s are used with each having 8 models trained on disjoint subsets of the data. The number of features used in the LRs is determined through a separate optimization routine [1].

  • RF: the number of classifier are 6 and the number of disjoint features subspaces are 6.

  • AdaBoost and Bagging: 48 DT were used as the weak learners for both algorithms.

In order to be able to compare the results obtained from this system with the correlation based MCMLPS, both the number of the LRs and the number of models inside the LRs are set to the same numbers (6 LRs with 8 models inside each one of the LRs). Furthermore the \(\alpha \) value (the ratio of the features assigned to the LRs) is set to \(30\%\) of the features. The base predictors used are CART DTs and feedforward Neural Networks (NNs). The following subsections discuss the internal accuracies of the LRs base predictors and compare the overall system performance with the benchmark algorithms. This is followed by a subsection that investigates the level of disagreement among the LRs prediction of the proposed MI based MCMLPS and compare its overall performance with benchmark algorithms.

Table 1. Data sets used in the experiments.

4.1 Internal Accuracy and Benchmark Comparison

In this section the internal accuracies of the LRs base predictors (CART DTs) are measured and compared across the four DPS folds. An example of the LRs base predictors internal accuracies for the Gaussian 8 dimensional data set is shown in Fig. 3. Figure 3 shows that, there are no single LRs that outperform the other LRs on all of the four folds. In the MI approach even small data sets like the Ionosphere data set, has a lower variation in its internal accuracies compared to the results of the correlation based MCMLPS [1]. A possible explanation for this is that the LRs in this case are trained on a subset of the features for the whole data set rather than being trained on disjoint subsets of the data. The overall testing accuracy of the MI based MCMLPS averaged over the four DPS iterations are shown in Table 2. In addition, the Table shows the test accuracies of the four benchmark algorithms (correlation based MCMLPS, RF, Bagging and AdaBoost algorithms). The results show that, this approach for generating the LRs has generally improved the testing accuracy obtained from the correlation based MCMLPS.

Fig. 3.
figure 3

Training accuracies of the local regions models for the Gaussian 8D data set when CART DTs are used as the base predictors.

Furthermore, it can be seen in Table 2 that Bagging has the highest test accuracy on all the data sets except for the waveform data set, where the RF has the highest accuracy. Nevertheless, our proposed MCMLPS has a comparable accuracy to the Bagging algorithm, with accuracy difference ranges from having the same accuracy for WBC data set to 6.2 for the heart data set. Furthermore, Table 3 shows the test accuracy of the MI based MCMLPS compared to the correlation base MCMLPS and the RF, when the type of the base predictors is changed from CART DTs to feedforward NNs. In the RF algorithm, the testing accuracy increases on every single data set when the feedforward NNs are used as the base predictors. On the other hand, the MI based MCMLPS showed mixed responses, where the accuracy increased for only 4 out of 11 data sets.

Table 2. Benchmark comparison: Testing accuracy using CART DTs as the base predictors for both correlation based and MI based MCMLPS.

4.2 Disagreement Among the Base Predictors

The disagreements among the LRs votes and the final prediction, when CART DTs as well as feedforward NNs are used as the base predictors for the MI based MCMLPs, are shown in Fig. 4. The total disagreement values are found by measuring the disagreement between the final prediction of the system and the prediction of the individual LRs ensembles. In Fig. 4 it can be noticed that, in the proposed architecture, when CART DTs are used as the base predictors there are varied levels of disagreements within the LRs models and even a higher level of disagreement across the LRs. On the other hand, when feedforward NNs are used as the base predictors, similar models are generated in the individual LRs, yet there is still a high level of disagreement across the LRs. The high level of disagreement of the proposed architecture can be beneficial when applied on noisy data sets.

Table 3. Benchmark comparison: Testing accuracy using feedforward NNs as the base predictors for the MI based MCMLPS.
Fig. 4.
figure 4

Disagreements among the LRs of MI based MCMLPS when CART DTs and feedforward NNs are used as the base predictors.

5 Variation of the Conditional Mutual Information

This section investigates the effect of changing three aspects of the proposed MI based architecture. These are: modifying the equation used to find the LR seeds, partitioning the data using Cross Validation (CV) instead of DPS and changing the ratio of features allocated to the LRs. Table 4 compares the testing accuracy for the proposed architecture when the data is sampled using DPS as well as CV and when the conditional redundancy is included or excluded from the CMI equation.

Table 4. Benchmark comparison: Testing accuracy using feedforward NNs as the base predictors for the MI based MCMLPS.

5.1 Ignoring the Conditional Redundancy with Respect to the Class

In this case the conditional mutual information term \(I(X_{k},S|Y)\) is removed from Eq. 5. This transforms the feature selection process to mutual information feature selection proposed by Battiti [2] given in Eq. 7:

$$\begin{aligned} J_{cmi}(X_{k})=I(X_{k};Y)-\beta I(X_{k};X_{j}) \end{aligned}$$
(7)

where \(\beta \) is a configurable parameter for which, according to Battiti [2], the optimal value is often 1. The aim of this section is to compare the case where correlated features are considered as redundant and are removed from the feature selection process with the case where the conditional redundancy between the features is assessed with respect to the class. The results showed that, apart from the German credit card data set, the cases where the conditional redundancy is considered in selecting the features, have higher accuracies than the cases where the conditional redundancy are removed during features selection.

5.2 Using CV Instead of DPS

In this subsection stratified CV is used to partition the data set into training and testing sets and then to partition the LRs data into K folds. Table 4 shows the testing accuracies averaged over the four iterations, and it can be seen that using DPS to split the data produce higher accuracies than that obtained from using stratified CV.

5.3 Changing the Ratio of Features Used in the LRs

In the previous experiments, the ratio of features used in the LRs of the MI based MCMLPS was set to \(30\%\). Using a higher or lower feature ratio have been tested on the data sets used in these experiments. It has been found that lowering this ratio from \(30\%\) to \(10\%\) decreases the accuracy of the LRs prediction as well as the overall accuracy of the system. Meanwhile, increasing it to \(80\%\) result in a slight improving in the prediction accuracy for some of the data sets used in this experiment and it remained unchanged for the rest.

6 Conclusions and Future Work

This paper introduces a local learning based algorithm for MCMLPS. The architecture consists of multiple LRs. Each LR has multiple models trained on subsets of the features. These subsets of features are assigned to the LRs according to the similarity calculated using their conditional mutual information.

Investigating the internal performance of the proposed architecture showed that the overall testing accuracies of the architecture exceeded the average internal accuracies of its LRs models. The amount of variation in the internal accuracy depends mainly on the size and dimensionality of the data. The results showed that both the number of LRs and the number of models developed within the LRs need to be optimised with respect to the data set size and dimensionality.

This paper also explored changing three aspects of the proposed architecture. The first aspect is modifying the equation used to find the LR seeds, where removing the correlation redundancy term from the CMI equation resulted in deterioration of the performance of the proposed architecture. This result support the claim in [4], that including correlated features can be useful if their correlation with the class is higher than their inner correlation. The second aspect is partitioning the data using CV instead of DPS. Changing the sampling technique did have a negative effect on the performance of the proposed architecture, where mainly the accuracy obtained from DPS is higher than that obtained from CV. Finally, increasing the ratio of the features used in the LRs may improve the accuracy of the MCMLPS for certain data sets.

The locality of the proposed architecture and the high level of disagreement among its base predictors can be beneficial in noisy environments. For example, when the noise is applied to only a part of the data, it will not have the same effect on all of the MCMLPS base predictors. The robustness of the proposed architecture to external noise will be investigated in future work.