Keywords

1 Introduction

In the last two decades, multiclassifier (MC) systems which combine responses of set of classifiers have been intensively developed. The reason is that different classifiers offer complementary information about the object to be classified and therefore MC system can achieve better classification accuracy than any single classifier in the ensemble.

MC system has three general phases [2]: (1) generation in which the training set is used to generate a pool of classifiers; (2) selection in which a single classifier (or an ensemble of classifiers) is selected to perform the classification; (3) combination (or integration) in which the final decision is made based on the predictions of the classifiers. It must be noted that selection and integration phases may be facultative, since for the classifier combination two main approaches used are classifier fusion and classifier selection [13]. In the first method, all classifiers in the ensemble contribute to the decision of the MC system, e.g. through sum or majority voting [11]. In the second approach, a single classifier is selected from the ensemble and its decision is treated as the decision of the MC system. The selection of classifiers can be either static or dynamic. In the static selection scheme, classifier is selected for all test objects, whereas dynamic classifier selection (DCS) approach explores the use of different classifiers for different test objects [6].

Recently, dynamic ensemble selection (DES) methods have been developed which first dynamically select an ensemble of classifiers from the entire set (pool) and then combine the selected classifiers by majority voting [3, 4, 12, 18]. In this way a DES based system takes advantage of both selection and fusion approaches. In most methods, the base classifiers are selected from the pool on the basis of their individual accuracy measure called competence in a local region of the feature space. These methods differ in algorithms for determining classifier competence and ways of defining the local regions.

In [23] two methods were proposed where the local accuracy (competence) of classifier is calculated as a simple percentage of correct classified samples from the validation set. In the first method called OLA (overall local accuracy), local accuracy is calculated in the region containing K-nearest validation objects of a test object. Whereas in the LCA (local class accuracy) method, classifier competence is determined considering only these validation objects from the K-nearest neighbors set which belong to the same class into which an unknown object is assigned. In [20,21,22] two methods using probabilistic model were developed. The idea of the first method is based on relating the response of the classifier with the response obtained by random guessing. The measure of competence reflects this relation and rates the classifier with respect to random guessing in a continuous manner. In this way, it is possible to evaluate a group of classifiers against a common reference point. Competent (incompetent) classifiers gain with such approach meaningful interpretation, i.e. they are more (less) accurate than the random classifier. In the second method, first a randomized reference classifier (RRC) is constructed which, on average, acts like the classifier evaluated. Next the competence of the classifier evaluated is calculated as the probability of correct classification of the respective RRC. Two interesting methods called A priori and A posteriori selection scheme was presented in [9]. In the A priori method, a classifier is selected based on its accuracy within the local region, without considering the class assigned to the unknown pattern. Similarly, in the A posteriori method, local accuracies are estimated using the class posterior probabilities and the distances of the samples in the defined local region. In [17] an interesting ranking-based approach to determine competence measure was proposed. In the method the ranking of base classifiers is done by estimating parameters related to the correctness of the classifiers in the pool. An interesting method called MCB (Multiple Classifier Behavior) was proposed in [10]. In this method the competence is defined as the classification accuracy calculated for a subset of a validation set which is generated as follows. First, the MCB is calculated for a test object and its K-nearest validation objects as a vector whose elements are class labels assigned by all classifiers in the ensemble. Next, similarity between the MCB’s are calculated using the averaged Hamming distance. Finally, the objects in the validation set that are the most similar to the test object are used to generate the subset. The original KNORA-Eliminate (KE) method belonging to the category of oracle-based methods was proposed in [12]. The oracles are represented by the K-nearest neighbors of the unknown pattern in the validation set and the KE method selects only those classifiers which are able to recognize the entire K-neighborhood of the test pattern.

In this paper a new method for calculating the classifier competence in the feature space is presented. In the proposed method, first the so-called decision profile of the classified object is determined using K-nearest validation objects. The decision profile provides the chance that the recognized object belongs to the specified class. In the probabilistic model the natural concept of decision profile is based on a posteriori probabilities of classes at the point x. Next, the decision profile is compared with the response produced by the classifier (support vector or values of discriminant functions) [7] and the competence is calculated according to the similarity rule: the closer the response to the profile is, the more competent the classifier is [14, 15]. Three different procedures for calculating a decision profile and three different measures for comparing the decision profile and the support vector are proposed in this study.

In a nutshell, originality of the proposed approach consists in a different use of the validation set. In the state-of-the-art-methods described above, the validation set is directly used for calculating local accuracy of a classifier (i.e. its local competence) via ranking-based, accuracy-based, probabilistic-based, behavior-based and oracle-based measures. However, in the proposed method, validation set is used for evaluating the classification profile of the test point and competence of the classifier is determined by similarity of its response to this evaluation.

The paper is divided into four sections and organized as follows. In Sect. 2 the measures of classifier competence are presented and two multiclassifier systems using proposed measures of competence in a dynamic fashion are developed. The performance of proposed MCS’s were compared with seven multiple classifier systems using 15 datasets taken from the UCI Machine Learning Repository. The results of computer experiments are described in Sects. 3, and 4 concludes the paper.

2 Multiclassifier System

2.1 Preliminaries

In the multiclassifier (MC) system we assume that a set of trained classifiers \(\varPsi =\{\psi _1, \psi _2, \ldots , \psi _L\}\) called base classifiers is given.

A classifier \(\psi _l\) is a function \(\psi _l:\;\mathcal {X}\,\rightarrow \,\mathcal {M}\) from a metric feature space \(\mathcal {X} \subseteq \mathbb {R}^{dim}\) to a set of class labels \(\mathcal {M}=\{1,2,\ldots ,M\}\). Classification is made according to the maximum rule

$$\begin{aligned} \psi _l(x)=i\Leftrightarrow d_{li}(x)=\max _{j\in \mathcal {M}}\ d_{lj}(x), \end{aligned}$$
(1)

where \([d_{l1}(x), d_{l2}(x), \ldots ,\) \( d_{lM}(x)]\) is a vector of class supports (classifying function) produced by \(\psi _l\). Without loss of generality we assume that \(d_{lj}(x)\ge 0\) and \(\sum _j d_{lj}(x)=1\).

In this paper, we propose MC systems which use a dynamic ensemble selection scheme and trainable combining methods based on a competence measure \(c(\psi _l|x)\) of each base classifier (\(l=1,2,...,L\)) evaluating the competence of classifier \(\psi _l\) at a point \(x \in \mathcal {X}\). Competence measure is normalized, i.e. \(0 \le c(\psi _l|x) \le 1\). \( c(\psi _l|x) = 0 (1)\) denotes the most incompetent (competent) classifier \(\psi _l\).

For the training methods of combining the base classifiers, it is assumed that a validation set

$$\begin{aligned} \mathcal {V}=\{(x_1,j_1), (x_2,j_2), \ldots ,(x_N,j_N)\}; \ \;x_k \in \mathcal {X},\ j_k \in \mathcal {M} \end{aligned}$$
(2)

containing pairs of feature vectors and their corresponding class labels is available.

2.2 Measure of Competence

K -neighborhood. Let first introduce the concept of K-neighborhood of object \(x \in \mathcal {X}\) which is defined as the set of K nearest neighbors of the point x from validation set \(\mathcal {V}\), viz.

$$\begin{aligned} \mathcal {S}_K(x)=\{x_{n_1}, x_{n_2}, \ldots x_{n_K} \in \mathcal {V}: \end{aligned}$$
$$\begin{aligned} \max _{k=1,2,\ldots ,K} || x_{n_k}-x || \le \min _{x_l \notin \mathcal {S}_K(x)} || x_l-x ||\}, \end{aligned}$$
(3)

where \(|| \cdot ||\) denotes the distance in the feature space \(\mathcal {X}\). The neighborhood size K is a parameter of the method – its value can be selected experimentally.

Decision Profile. Decision profile of object \(x \in \mathcal {X}\)

$$\begin{aligned} \delta (x)=[\delta _1(x),\delta _2(x), \ldots ,\delta _M(x)], \ \delta _j(x) \ge 0, \ \sum _j \delta _j(x)=1 \end{aligned}$$
(4)

denotes the vector of normalized values where the jth value \(\delta _j(x)\) is interpreted as a measure of chance that object x belongs to the jth class (\(j \in \mathcal {M}\)). In the probabilistic model the natural value of \(\delta _j(x)\) is a posteriori probability of jth class at the point x.

We propose the following methods for calculating decision profile at the point x using its K-neighborhood.

The Fraction-based Method (FM)

In this approach, \(\delta _j(x)\) is calculated as the fraction of objects from the jth class in the set \(\mathcal {S}_K(x)\). Let \(M_j^{(K)}(x)\) be the number of validation objects from \(\mathcal {S}_K(x)\) belonging to the j-th class. Then

$$\begin{aligned} \delta _j(x) = \frac{M_j^{(K)}(x)}{K}, \ \ j \in \mathcal {M}. \end{aligned}$$
(5)

The Ranking-based Method (RM)

In the RM method \(\delta _j(x)\) is equal to the normalized sum of ranks of objects belonging to the jth class in the set \(\mathcal {S}_K(x)\). Let \(r(x_k)\) be the rank of validation object \(x_k \in \mathcal {S}_K(x)\). The nearest neighbour has the rank equal to K, the rank of the furthest neighbor is equal to 1. Then

$$\begin{aligned} r_j(x) = \sum _{x_k \in \mathcal {S}_K(x): j_k=j} r(x_k) \end{aligned}$$
(6)

is the sum of ranks of validation objects from the K neighborhood of x belonging to the jth class. And next

$$\begin{aligned} \delta _j(x) = \frac{r_j(x)}{\sum _{k=1}^{K} k}. \end{aligned}$$
(7)

The Potential Function Method (PM)

Let \(H(x,x_k)\) be a non-negative potential function [16] decreasing with the increasing distance between x and \(x_k\). In this study, a Gaussian potential function with the Euclidean distance is used:

$$\begin{aligned} H(x,x_k) = \exp (- ||x - x_k||^2). \end{aligned}$$
(8)

Then, we can calculate \(\delta _j(x)\) as a normalized sum of potential functions (8) for objects belonging to the jth class from the set \(\mathcal {S}_K(x)\), namely:

$$\begin{aligned} \delta _j(x) = \frac{\sum _{x_k \in \mathcal {S}_K(x): j_k=j} \exp (- ||x - x_k||^2)}{\sum _{j \in \mathcal {M}} \sum _{x_k \in \mathcal {S}_K(x): j_k=j} \exp (- ||x - x_k||^2)}. \end{aligned}$$
(9)

Distance Between Decision Profile and Vector of Supports. In order to evaluate \(\psi _l\) at x and determine its competence \(c(\psi _l|x)\), we must compare decision profile \(\delta (x)\) and vector of supports \(d_l(x)\) and calculate distance \(dist[\delta (x), d_l(x)]\). Competence measure is a normalized function of this distance decreasing with the increasing distance between \(\delta (x)\) and \(d_l(x)\). In particular, \(c(\psi _l|x)\) is equal to 1 (0) if distance is equal to 0 (is the greatest one).

We propose three different methods for calculating distance \(dist[\delta (x), d_l(x)]\) and the resulting measures of competence.

Euclidean Distance (ED)

We adopt the Euclidean distance

$$\begin{aligned} dist[(\delta (x),d_l(x)] = ||\delta (x) - d_l(x)|| \end{aligned}$$
(10)

and hence we get

$$\begin{aligned} c(\psi _l|x) = \frac{\sqrt{2} - ||\delta (x) - d_l(x)||}{\sqrt{2}}. \end{aligned}$$
(11)

Max-Max Distance (MD)

Let j be the class number for which classifier \(\psi _l\) produced the greatest support value at the point x (i.e. \(d_{lj}(x) = \max _{k \in \mathcal {M}} (d_{lk}(x)))\). Similarly, let i be the class number with the greatest value in the decision profile \(\delta (x)\) at x. Then, the max–max distance is defined as:

$$\begin{aligned} dist[(\delta (x),d_l(x)] = |d_{lj}(x) -\delta _j(x)| + |d_{li}(x) -\delta _i(x)|. \end{aligned}$$
(12)

Hence we have the following formula for competence measure:

$$\begin{aligned} c(\psi _l|x) = \frac{2 - |d_{lj}(x) -\delta _j(x)| + |d_{li}(x) -\delta _i(x)|}{2}. \end{aligned}$$
(13)

Hamming Distance (HD)

Let \(h(\psi _l(x))=[j^{'}_1,j^{'}_2, \ldots ,j^{'}_M]\) and \(h(x)=[j^{''}_1,j^{''}_2, \ldots ,j^{''}_M]\) be the vectors of class numbers ordered according to the decreasing values of supports produced by \(\psi _l\) at x and decision profile of x, respectively. Distance between \(d_{l}(x)\) and \(\delta _j(x)\) is defined as the Hamming distance between vectors \(h(\psi _l(x))\) and h(x), namely

$$\begin{aligned} dist[(\delta (x),d_l(x)] = D_H[h(\psi _l(x)), h(x)]. \end{aligned}$$
(14)

Hence we get the following form of competence measure

$$\begin{aligned} c(\psi _l|x) = \frac{M - D_H[h(\psi _l(x)), h(x)]}{M}. \end{aligned}$$
(15)

Example. Consider a classification problem with three classes (\(M=3\)). Figure 1 presents 6-neighborhood of an object x in the two-dimensional feature space. Additional unit grid will help to determine distances between objects. Suppose that classifier \(\psi \) produced supports \(d_1(x)=0.3\), \(d_2(x)=0.6\) and \(d_3(x)=0.1\). Our purpose is to determine the competence \(c(\psi |x)\) of the classifier \(\psi \) at the point x using presented methods.

Fig. 1.
figure 1

Illustration of Example: 6-neighborhood of an object x.

From Fig. 1 we simply get the Euclidean distances between x and validation objects:

$$\begin{aligned} ||x-x_1||=2, \;||x-x_2||=5, \;||x-x_3||=2.83, \end{aligned}$$
$$\begin{aligned} ||x-x_4||=2.23, \;||x-x_5||=3, \;||x-x_6||=3,6. \end{aligned}$$

First, we calculate decision profiles for the proposed methods.

FM method:

$$\begin{aligned} M_1^{(5)}=2, M_2^{(5)}=3, M_3^{(5)}=1 \; \text {and hence}\; \delta _1(x)=1/3, \delta _2(x)=1/2, \delta _3(x)=1/6. \end{aligned}$$
(16)

RM method:

$$\begin{aligned} r_1(x)=10, \; r_2(x)=9, \;r_3(x)=2 \end{aligned}$$

and hence

$$\begin{aligned} \delta _1(x)=10/21, \; \delta _2(x)=9/21, \; \delta _3(x)=2/21. \end{aligned}$$
(17)

PM method:

$$\begin{aligned} H(x,x_1)+H(x,x_3)=0.135+0.059=0.194, \; H(x,x_6)=0.027 \end{aligned}$$
$$\begin{aligned} H(x,x_2)+H(x,x_3)+H(x,x_5)=0.006+0.108+0.049=0.163 \end{aligned}$$

and hence

$$\begin{aligned} \delta _1(x)=\frac{0.194}{0.384}=0.505, \ \delta _2(x)=\frac{0.163}{0.384}=0.424, \ \delta _3(x)=\frac{0.027}{0.384}=0.071, \end{aligned}$$
(18)

Now, using formulas (10)–(15) and calculated decision profiles (16), (17) and (18), we can calculate competence \(c(\psi |x)\) of classifier \(\psi \) at the point x. Results for all combination of calculating decision profile methods and concept of distance between decision profile and vector of supports are presented in Table 1.

Table 1. Results of example.

2.3 DES Systems

The proposed measure of competence can be applied in any multiclassifier system in selection/fusion algorithm provided that the feature space \(\mathcal {X}\) is a metric space. In this subsection we describe two multiclassifier systems based on the DES strategy.

Multiclassifier System with Fusion at the Decision Level (MC1). In this system, first a subset \(\varPsi ^{*}(x)\) of base classifiers with the competences greater than the random guess is selected for a given x:

$$\begin{aligned} \varPsi ^{*}(x)=\{\psi _{l1},\psi _{l2}, \ldots ,\psi _{lT}\},\ \text {where} \ \ c(\psi _{lt}|x) > 1/M. \end{aligned}$$
(19)

The selected classifiers are combined using the weighted majority voting rule where the weights are equal to the competences. This fusion method leads to the following class supports (\(j=1,2,\ldots M\)):

$$\begin{aligned} d^{(MC1)}_j(x) = \sum ^{T}_{t=1} c(\psi _{lt}|x) \left\lfloor \psi _{lt}(x)=j \right\rfloor , \end{aligned}$$
(20)

where \(\left\lfloor \cdot \right\rfloor \) denotes the Iverson bracket.

The MC1 system \(\psi ^{(MC1)}\) classifies x using the maximum rule:

$$\begin{aligned} \psi ^{(MC1)}(x)=i \Leftrightarrow d_{i}^{(MC1)}(x) = \max _{j \in \mathcal {M}} d_{j}^{(MC1)}(x). \end{aligned}$$
(21)

Multiclassifier System with Fusion at the Continuous-Value Level (MC2). The MC2 system is identical to the MC1 system except that selected classifiers (19) are combined at the continuous-value level (\(j=1,2,\ldots M\)):

$$\begin{aligned} d^{(MC2)}_j(x) = \sum ^{T}_{t=1} c(\psi _{lt}|x) d_{lt,j}(x). \end{aligned}$$
(22)

Final decision – as previously – is made according to the maximum rule:

$$\begin{aligned} \psi ^{(MC2)}(x)=i \Leftrightarrow d_{i}^{(MC2)}(x) = \max _{j \in \mathcal {M}} d_{j}^{(MC2)}(x). \end{aligned}$$
(23)

The MC2 system with competence measures developed will be applied in the experimental investigations.

3 Experiments

3.1 Experimental Setup

The performance of the developed MC systems was evaluated in experiments using 15 benchmark data sets. In the first experiment, the MC2 system was evaluated using different methods for calculating decision profile (FM, RM and PM) and different distances between decision profile and support vector (ED, MD and HD). The methods that showed the best performance were identified. In the second experiment, the methods identified were compared with other competence–based MC systems. The experiments were conducted in MATLAB using PRTools 4.1 [8]. In both experiments, the value of \(K=5 \times M\) (M denotes the number of classes) was used as the neighborhood size.

The 15 benchmark data sets were taken from the UCI Machine Learning Repository [1]. We selected the same data sets which were used in experimental investigations presented in [21]. A brief description of the data sets used is given in Table 2.

Table 2. The data sets used in the experiments.

For each data set, feature vectors were normalized to zero mean and unit standard deviation. Two-fold cross-validation was used to extract training and test sets from each data set. For the calculation of the competences, a two-fold stacked generalization method was used [24]. In the method, the training set is split into two sets A and B of roughly equal sizes. Set A is first used for the training of the classifiers in the ensemble while set B is used for the calculation of the competences. Then, set B is used for the training while the competences are calculated using set A. Finally, the competences calculated for both sets are stacked together and the classifiers in the ensemble are trained using the union of sets A and B (i.e. the original training set). In this way, the competences of the classifiers are calculated for all the feature vectors in the original training set, but the data used for the calculation is unseen during the classifier training.

The experiments were conducted using two ensemble types: homogeneous and heterogeneous. The homogeneous ensemble consisted of 50 feed-forward backpropagation neural network classifiers with one hidden layer and the maximum number of learning epochs set to 80. Each neural network classifier was trained using randomly selected 70% of the objects from the training data set. The heterogeneous ensemble consisted of the following 11 base classifiers [7]:

  • (1) linear classifier based on normal distribution with the same covariance matrix for each class;

  • (2) quadratic classifier based on normal distribution with different covariance matrix for each class;

  • (3) nearest mean classifier;

  • (4–6) k-nearest neighbors classifiers with k = 1, 5, 10;

  • (7, 8) Parzen density based classifier with the Gaussian kernel and the optimal smoothing parameter \(h_{opt}\) (and the smoothing parameter \(h_{opt}/2\));

  • (9) pruned decision tree classifier with the Gini splitting criterion;

  • (10–11) feed-forward backpropagation neural network classifier containing one hidden layer with 10 neurons (two hidden layers with 5 neurons each) and the maximum number of learning epochs set to 80;

The performance of the systems constructed was compared with the following seven MC systems:

  1. 1.

    Overall local accuracy method (OLA1) [23]. In this method the competence at a test point x is calculated as the percentage of the correct recognition of the K-nearest validation samples of x;

  2. 2.

    Local class accurracy method (LCA) [23]. In this method the competence is estimated for each base classifier as the percentage of correct classifications within the local region (the K neighborhood), but considering only examples from the class as classifier gives for the unknown pattern;

  3. 3.

    Overall local accuracy method (OLA2) [19]. In this method the competence is calculated as in OLA1 approach but validation objects from the K-neighborhood are additionally weighted by their Euclidean distances to the unknown object x;

  4. 4.

    Multiple classifier behavior method (MCB) [10]. In this method the competence is calculated using a similarity function to measure the degree of similarity of the output profiles of all base classifiers;

  5. 5.

    Oracle KNORRA-eliminate method (ORE) [12]. In this method all classifiers are selected that correctly classify all samples in the local region (the K neighborhood). If no classifiers are selected, the local region is reduced until at least one classifier is selected;

  6. 6.

    Randomized reference classifier method (RRC) [21]. In this method the competence of base classifier is calculated as the probability of correct classification of randomized reference classifier (RRC) which - on average - acts as a modeled base classifier;

  7. 7.

    Random guessing based method (RGM) [22]. In this method the competence is calculated in relation to the random guessing method – the classifier is considered as competent (incompetent) if it is more (less) accurate than the random classifier.

3.2 Results and Discussion

Classification accuracies were averaged over 5 repetitions of two-fold cross-validation. Statistical differences in rank between the systems were obtained using the Friedman test with Iman and Davenport correction combined with the post hoc Holm’s stepdown procedure [5]. The average ranks of the systems and a critical rank difference calculated using the Bonferroni-Dunn test are visualised. The level of \(p<0.05\) is considered as statistically significant.

The average ranks obtained from the first experiment for the nine methods proposed and for the homogeneous and heterogeneous ensembles are presented in Figs. 2A and B, respectively. The use of the potential function method for calculating decision profile and max-max distance between decision profile and support vector (PM-MD) resulted in the best average rank regardless of the ensemble type used. The average rank of PM-MD method is significantly better than average ranks for FM-MD, RM-MD, PM-HD, FM-HD and RM-HD methods. Methods with the Hamming distance achieved the worst average ranks regardless of the method for calculating decision profile and the ensemble type used. Thus, for the second experiment the PM-MD method was selected.

Fig. 2.
figure 2

Average ranks of the MC2 systems for different methods of calculating decision profile and different distances between decision profile and support vector for homogeneous (A) and heterogeneous (B) ensemble of base classifiers. The interval (thick line) is the critical rank difference (2.991) calculated using the Bonferroni-Dunn test (\(p<0.05\)).

Fig. 3.
figure 3

Average ranks of MC systems compared for homogeneous (A) and heterogeneous (B) ensemble of base classifiers. The interval (thick line) is the critical rank difference (2.394) calculated using the Bonferroni-Dunn test (\(p<0.05\)).

Table 3. Classification accuracies (in percent) and average ranks of the PM-MD system and the eight MCSs for the homogeneous ensemble. The best result for each data set is in bold.
Table 4. Classification accuracies (in percent) and average ranks of the PM-MD system and the eight MCSs for the heterogeneous ensemble. The best result for each data set is in bold.

The results obtained in the second experiment for the PM-MD method and seven MC systems and for the homogeneous and heterogeneous ensembles are presented in Tables 3 and 4 and in Figs. 3A and B, respectively. The system constructed achieved the second best average ranks for both types of classifier ensemble. The average rank of PM-MD method is significantly better than average ranks for LCA, RGM, OLA1 and OLA2 methods regardless of the ensemble type used.

4 Conclusion

Nowadays, many researches have been focused on MC systems and consequently, many new solutions have been dedicated to each of the two main approaches: classifiers fusion and classifiers selection. In the proposed solutions the fundamental role plays the assessment of competence of base classifiers which is crucial in the DES scheme and in the combining of base classifiers. In the paper a new method for calculating the competence of a classifier in the feature space was presented. In the proposed method, first the K-neighborhood is used to determine the so-called decision profile of a test object. The decision profile is an evaluation of the chance that the recognized object belongs to particular classes. Next, the decision profile is compared with the response produced by the classifier and the competence is calculated according to the similarity rule. The MC systems with DES scheme using the proposed competence measure were developed and experimentally evaluated using 15 benchmark datasets. Experimental results showed that the idea of calculating the competence of a classifier by comparing its response with the decision profile of the classified object is a correct method and leads to the accurate and efficient multiclassifier systems.