1 Introduction

It is widely accepted that ensemble classifiers are more accurate than a single classifier (Zhang et al. 2006). A number of effective ensemble methods have been developed over the previous decades, e.g. bagging (Breiman 1996) and boosting (Freund and Schapire 1996). There is no integrated theory behind ensemble methods and several authors stated that there is no available consistent and theoretically sound explanations for ensemble classifiers (Kuncheva 2004). Despite these negative reflections, at least three theories are able to explain the effectiveness of ensemble learners. The first theory is based on large margin classifiers (Mason et al. 2000). It is shown that ensemble classifiers enlarge the margins and improve the capability of generalization performance of Output coding (Allwein et al. 2000) which comes from Vapnik’s Statistical Learning Theory (Vapnik 1998). The second theoretical argument is on bias-variance decomposition of error which states that ensemble classifiers reduce the variance or both bias and variance (Breiman 1998; Kong and Dietterich 1995; Schapire 1999). The last general theory is constructed upon a set theoretical point of view to remove all algorithmic details of classifiers and training procedures in which the classifiers are considered as sets of points (Kleinberg 1996, 2000).

The effectiveness of ensemble methods relies on the diversity of the classifiers and accurate learning models. Despite their effectiveness, they may require extensive memory to save all the learning models and it can be time consuming to get a prediction on unlabeled test data. For small data sets, these two costs can be negligible, but they may be prohibitive when ensemble methods are applied to large scale data sets. Furthermore, the size of ensemble may lead to inefficiency and selecting subsets of classifiers can improve the generalization performance (Zhang et al. 2006).

Several pruning algorithms have been developed such as ranking classifiers according to their individual performance and picking the best one. The major problem in selecting the best subset for the ensemble occurs when optimizing some criteria of subsets since it turns out to be a greedy search algorithm which has no theoretical or empirical quality guarantee (Zhang et al. 2006). In Zor et al. (2014), a novel approach is proposed to investigate a given base classifier’s effectiveness by measuring its accuracy \(k\) times with respect to each individual class of a \(k\) class problem, averaging the results. The proposed measure is used to prune the ensemble by using ordered aggregation and referred to ACEC. Since pruning is performed by ordered aggregation, as the size of ECOC matrix increase, running time of ACEC increases. Like in the other ordered aggregation methods, pruning size should be given as an input which leads the problem parameter dependent in Zor et al. (2014). Evolutionary pruning methods for ensemble selection are developed in Kim et al. (2002). Unlike the heuristic methods we mentioned above, ensemble pruning may be formulated as quadratic integer programming problem and solved by semi-definite programming which searches for optimal diversity-accuracy trade off (Zhang et al. 2006). This new approach differs from previous ensemble methods as it consists of discrete variables and can be considered as the discrete version of weighted ensemble optimization. The optimization problem contains a parameter \(k\) which is the ensemble size of the selected subset. Since it must be chosen beforehand, the solution, here, also depends on this parameter and hence the accuracy of the ensemble.

Similar optimization-based approach is proposed in Yin et al. (2014a) which considers accuracy and diversity measures as in Zhang et al. (2006) by a continuous optimization model with sparsity learning. The objective function in Yin et al. (2014a) is defined by a loss function and is chosen specifically to be the least square function. Weights for each classifier are forced to be sparse by \(L_1\) norm regularization within the constraints to determine subset of an ensemble and weights of classifiers in the ensemble are found out by minimizing the error rate by least square function defined in the objective function. Diversity is taken into account with an additional diversity function defined by Yule’s Q statistics within the constraints of the optimization problem. Since the model in Yin et al. (2014a) is defined by least squares function which contains true label vectors, it cannot be applied to ECOC framework directly. Because columns of ECOC matrix provide different problems with different relabellings. Yin et al. (2014a) then further improved their work by penalizing sparsity and diversity constraints within objective function in Yin et al. (2014b). In Yin et al. (2014b), error is minimized by the same idea by least square function.

In this paper, unlike pruning methods in Zor et al. (2014) and Zhang et al. (2006), pruning size is not required as an input in the problem. Our first contribution is the reformulation of the quadratic integer formulation of Zhang et al. (2006) as an unconstrained problem by approximating cardinality constraint which refers to the ensemble size defined by zero norm approximation. Therefore, the problem becomes not only parametric-free of ensemble size but also it results in a non convex continuous optimization problem. The non convex problem here is reformulated by difference of convex functions as in Sriperumbudur et al. (2011) and solved by nonlinear programming solver function fiminunc in MATLAB’s optimization toolbox (Branch and Grace 2002). In this paper, the overall ensemble pruning problem is adapted to ECOC which is the second novel contribution of this study. The performance of our proposed approach ECOC with UP is compared with well known methods Reduced Error Pruning, Random Guessing, Kappa Pruning and recent approaches ACEC in Zor et al. (2014), SDP in Zhang et al. (2006).

Unlike with the methods mentioned above, subclass technique (subECOC) is developed in Bouzas et al. (2011) on the ECOC framework where by splitting the initial classes of the problem, larger but easier problem to solve ECOC configurations. The multiclass problem’s decomposition is performed by discriminant tree approach.

Our proposed method has common goals and optimization structure with the method in Yin et al. (2014a, b) with its diversity and sparsity notions. However, expression of accuracy, diversity and sparsity differs by their respective objective function and constraints. Furthermore it is developed to prune ECOC matrix. One of the important aspect of our algorithm is that it can be applied to ECOC framework but it is not possible to adapt objective function in Yin et al. (2014a, b) to ECOC framework because of the loss function term. It should be noted that ECOC columns represent different labelings which constitute different binary problem and hence do not agree with the loss term in Yin et al. (2014a, b).

The proposed framework is novel, since optimising the accuracy/diversity trade-off for ECOC through pruning by using optimization has not previously been attempted. The closest approach is Zhang et al. (2006), but that is restricted to two-class problems, and is a discrete optimization. Furthermore our approach automatically determines the optimal pruning rate as part of the optimisation, and therefore does not need to be supplied as a parameter in advance. In Zor et al. (2014), ECOC is pruned by ordered aggregation which achieves high accuracy but on the other hand it slows down the running time as the number of examples and the number of classes increase. The proposed approach in this work differs from the method in Zor et al. (2014) since pruning is modeled by a continuous optimization framework which gives exact number of classifiers in the subensemble unlike the ordered aggregation in Zor et al. (2014).

The rest of the paper is organized as follows. In Sect. 2, ECOC and accuracy/diversity trade-off will be reviewed. Section 3 describes our method with the mathematical formulation and the solution technique. Section 4 is devoted to the experiments and concludes with a discussion in Sect. 5.

2 Error Correcting Output Codes

The Error Correcting Output Codes (ECOC) (Dietterich and Bakiri 1995) framework is a general method for multiclass classification by embedding of binary classifiers. Given a set of \(k\) classes, ECOC generates a coding matrix \(M\) of size \(k\times n\) in which each row corresponds to a codeword per class, i.e., ith row refers to the codeword of length \(n\) for ith class. Each codeword consists of \(\{-1,+1\}\) binary entries. In terms of learning, \(M\) is constructed by taking into account \(n\) binary problems where each one corresponds to a column of \(M\). Each of these binary problems (or dichotomizers) splits the multiclass problem into two class coded by \(-1\) or \(+1\) (or \(0\) if the class is not considered) in \(M\). Then at the decoding step, applying each trained classifier will give a binary output on the test set which will form a codeword for the test point. The class of the test point is determined by finding the minimal distance between the codeword of the test point and codeword of classes in \(M\). The data point is assigned to the closest codeword in \(M\).There are different decoding strategies in the literature, for a closer look please see Escalera et al. (2010). In this paper, we will use Hamming distanceFootnote 1 as a distance measure between codewords. In Fig. 1, an example of ECOC coding/decoding is given.

Fig. 1
figure 1

An example of ECOC framework with Support Vector Machine (SVM) base classifiers

The ECOC framework is independent of base classifiers and has been shown to reduce bias and variance produced by the learning algorithms as mentioned in Kong and Dietterich (1995). Because of these reasons, ECOC has been widely used for the multiclass classification problems.

2.1 Accuracy diversity trade off

Diversity has long been recognised as a necessary condition for improving ensemble performance, since if base classifiers are highly correlated, it is not possible to gain much by combining them. The individual classifiers may be very accurate, and indeed the more accurate they become, the less diversity exists between them (Windeatt 2006). This dilemma has become known as the accuracy/diversity trade-off. There have been many attempts to define diversity (Kuncheva and Whitaker 2003), but the consensus is that no matter how it is defined, there does not exist any measure that can by itself predict generalisation error of an ensemble. Since there is a lack of a general theory on how diversity impacts ensemble performance, experimental studies continue to be an important contribution to discovering whether a relationship exists and if so whether it can be quantified and understood.

For solving multiclass problems using ECOC, considerations of the accuracy/diversity trade-off are even more complex. The original motivation for ECOC was based on error-correcting theory, which assumes that errors are independent. However, when applied to multiclass machine learning problems, the error correlation depends on the data set, base classifier as well as the code matrix. In the original ECOC approach (Dietterich and Bakiri 1995), heuristics were employed to maximise the distance between the columns to reduce error correlation. There have been other attempts to address error correlation. Hadamard matrices maximise distance between rows and columns and were used as ECOC code matrix in Guruswami and Sahai (1999), in which an upper bound on probability of error correlation was derived as a function of minimum distance between code words. In Allwein et al. (2000) it was shown that a high minimum distance between any pair implies a reduced upper bound on the generalisation error, and in James (1998) it was shown for a random matrix that if the code is equi-distant and long enough, then decision-making is bayes-optimal. More recent approaches aim to introduce problem-dependent designs to address error correlation (Escalera et al. 2010).

Based on previous research, it is possible to summarise the main considerations in designing ECOC matrices

  • minimum Hamming Distance between rows (error-correcting capability)

  • variation of Hamming Distance between rows (effectiveness of decoding)

  • number of columns ( repetition of different parts of sub-problems )

  • Hamming Distance between columns and complement of columns (independence of base classifiers).

All these constraints make optimal design of coding and decoding strategies a complex problem. Previous studies have attempted to address some of these constraints experimentally. An accuracy/diversity study was carried out in Windeatt (2006) using a random code matrix with near equal split of classes (approximately equal number of 1’s in each column), as proposed in Schapire (1997). It is proved in Windeatt and Ghaderi (2003) that optimal performance is attained if codes are equi-distant, and an experimental comparison is made of random, equi-distant and non-equi-distant code matrices. However, the proposed approach in this paper is the first to incorporate pruning into ECOC and address the optimisation of the accuracy/diversity trade-off in a principled fashion.

3 Pruning methods

In this section, we introduce a family of pruning methods based on ordered aggregation. The order of aggregation is determined according to the different measures such as voting, accuracy and diversity in which the initial ensemble starts with the optimal measures and iteratively adds new candidate classifiers based on diversity/accuracy of the ensemble on the training set \(Z_{training}\).

As discussed in Sect. 2.1, a single measure of accuracy or diversity alone is not a sufficient criterion to prune the ensembles. Both accuracy and the diversity must be considered together for selecting the classifiers in the ensemble. Some rules that change the order of aggregation are Reduced Error Pruning Method (REP) (Margineantu and Dietterich 1997), Kappa Pruning (KP) (Margineantu and Dietterich 1997), Complementarity Measure (Martinez-Mungoz and Suarez 2004), Margin Distance Minimization (MDSQ) (Martinez-Mungoz and Suarez 2004), Orientation Ordering (Martinez-Mungoz and Suarez 2006) and Boosting Based Pruning (Martinez-Mungoz and Suarez 2007). In this section, we will give the idea of the first two methods, namely REP and KP and continue with our method developed for ECOC.

3.1 Reduced Error Pruning (REP)

This method is first introduced in Margineantu and Dietterich (1997). The result of combining the predictions of the classifiers in an ensemble \(E_{T}= \{h_t (x)\}_{t=1}^{T}\) using equally weighted voting is

$$\begin{aligned} H_{E_T}(x) = \arg \max _{y} \sum _{t=1}^{T} I\left( h_t(x)=y\right) , \quad y\in Y \end{aligned}$$

where \(h_t\) is the hypothesis and \(Y\) is the set of labels. The first classifier in the ensemble is the one having the minimum classification error. The algorithm searches for the second classifier which makes the classification error of the new ensemble minimum. After it finds the new ensemble, it iteratively adds the rest of the classifiers one by one and incorporates the one which gives the lowest ensemble error for the new ensemble until it reaches the desired ensemble size. The subensemble \(S_u\) is constructed by adding to \(S_{u-1}\), the classifier

$$\begin{aligned} s_u = \arg \max _{k} \sum _{x,y \in Z_{training}} I\left( H_{S_{u-1}\cup h_k} (x) =y\right) ,\quad k \in E_{T}\setminus S_{u-1}, \end{aligned}$$

where \(k\) runs over the all classifiers which haven’t been selected up to that iteration and \(y\in Y\) and \(I\) is the indicator function.

In this paper, we adapted REP algorithm to ECOC by the same manner. SVM with Gaussian kernel is used as a base classifier for the ECOC learners. Each base classifier is determined by 5 fold cross validation and REP is applied on 10 different random folds in which each fold has its own ensemble. The size of the subensemble is chosen to be the same size as the subensemble of the pruned ECOC matrix proposed in this study.

3.2 Kappa Pruning (KP)

This method is based on selecting the most diverse classifiers by using \(\kappa \) statistics (Kuncheva and Whitaker 2003). As in REP, Kappa Pruning iteratively adds a new classifier to the ensemble which gives the minimum pairwise \(\kappa \) measure. It starts with the pair of classifiers which have the minimum pairwise \(\kappa \) diversity. Then, it adds the classifier which makes the mean of the pairwise diversities minimum in the ensemble. Likewise in REP, here the formula differs only with kappa measure

$$\begin{aligned} s_u = \arg \max _{k} \kappa _{Z_{training}} \left( h_k,H_{S_{u-1}}\right) , k \in E_{T}\setminus S_{u-1}, \end{aligned}$$

where \(\kappa \) is the pairwise diversity measure given by \( \kappa = \frac{N^{00}}{N^{11} + N^{10} + N^{01} + N^{00}}\) in Kuncheva and Whitaker (2003). Here \(N^{ab}\) is the number of elements where \(y_i =a\) or \(y_j = b\). In this study, KP is adapted to ECOC by using exactly the same logic. As with REP, SVM is used as base classifier for ECOC and CV is applied as in Sect. 3. The method is tested on ten different random folds in which each fold has its own ensemble.

3.3 Pruning of ECOC by using optimization model

It is known that as the number of classes increases, the number of base classifiers in the ECOC matrix also increase for exhaustive search. Let us remember that if \(k\) is the number of classes, the number of columns in ECOC can be at most \(\frac{2^{k}}{2} -1\). As \(k\) increases the number of base classifiers in the ensemble increases exponentially. Hence the running time and the efficiency of the algorithm decreases. In this study we propose a pruning method to select the best accurate and diverse classifiers from exhaustive ECOC coding by incorporating the diversity measure of Zhang et al. (2006). In order to get good mathematical formulation of trade off, error structure of the problem is represented by a linear combination of base classifiers’ errors and diversities from the error analysis in Zhang et al. (2006). It is claimed that if the strength and diversity measurements for a classification ensemble can be found, a linear combination of them should serve as a good approximation of the overall ensemble error. Minimizing the approximate ensemble error function will be the objective of mathematical programming. In Zhang et al. (2006), the error of each base classifier is reported in the matrix \(P\) on the training set as follows:

  • \(P_{i,j} = 0\), if the jth classifier is correct on data point \(i\),

  • \(P_{i,j} = 1\), otherwise.

A matrix \(G\) is defined to be the error matrix by \(G= P^TP\) since the diagonal term \(G_{ii}\) will be total error that classifier \(i\) makes, and the off diagonals \(G_{ij}\) will be the common errors that classifier \(i\) and \(j\) make so that off diagonals correspond to the measure for diversity. The matrix \(G\) is then normalized to put all the elements in the same scale as

$$\begin{aligned} \tilde{G}_{ii} = \frac{G_{ii}}{N}, \end{aligned}$$

where \(N\) is the number of training points and

$$\begin{aligned} \tilde{G}_{ij, i\ne j} = \frac{1}{2}\left( \frac{G_{ij}}{G_{ii}} + \frac{G_{ij}}{G_{jj}} \right) . \end{aligned}$$
(1)

We note that pruning of the ECOC matrix in this study differs from choosing the best columns of ECOC matrix since pruning here is based on both the observed error rates and diversity measures on the training sets. As discussed above, \(\tilde{G}_{ii}\) represents the total error made by classifier \(i\) and \(\tilde{G}_{ij}\) measures the common errors made by pairwise classifiers \(i\) and \(j\). Note that the new matrix \(\tilde{G}\) is symmetric by taking the average of \(\frac{G_{ij}}{G_{ii}}\) and \(\frac{G_{ij}}{G_{jj}}\). Furthermore, \(\sum _i{\tilde{G}_{ii}}\) measures the overall strength of the ensemble and \(\sum _{ij,i\ne j}{\tilde{G}_{ij}}\) measures the diversity. The mathematical programming is formulated by quadratic integer problem in Zhang et al. (2006) where a fixed subset of classifiers is searched as a constraint by the following problem

$$\begin{aligned} \begin{array}{r} {\mathop {\mathrm{min}}\limits _{x}}\, x^{T}\tilde{G}x \\ \hbox {subject to } {\mathop {\sum }\limits _{i}}{x_i} = k, \\ x_{i} \in \{0,1\}. \end{array} \end{aligned}$$
(2)

Here, \(x_i\) represents whether the ith classifier is in the ensemble or not. If \(x_{i}=1\), then it means ith classifier is chosen to be in the ensemble, if \(x_i=0\) then it is not considered in the ensemble and \(k\) is the length of the ensemble and should be chosen as a parameter beforehand. The problem (2) is NP hard in general but it is approximated as a max-cut problem in Zhang et al. (2006) and solved by semi-definite programming (SDP). The overall efficiency of the problem depends on the ensemble size \(k\) since it is the parameter given before solving the problem and the solution changes accordingly.

3.3.1 Unparametrized pruning (UP)

In this study, we get rid of the parameter of ensemble size simply by adding penalization term to the objection function with a regularization constant \(\rho \). Note that the constraint in Eq. (2) can be written as \(\left\| x\right\| _{0} = k\). Here zero norm is defined by the number of non zero elements which leads the sparsity in the model. Instead of determining the pruning rate \(k\) in (2), finding the indices of non zero entires of \(x\) corresponds to the pruning rate which refers to the number of classifiers in the subensemble. Furthermore, by this way and with the help of sparsity, we introduced the relaxation of the binary vector to the real vector, i.e. \(x\in \mathbf{R}^n\). Then the Eq. (2) becomes an unconstrained problem which is the regularized version of the problem (2).

$$\begin{aligned} \min _{x\in R^n} x^{T}\tilde{G}x + \rho \left\| x\right\| _{0} \end{aligned}$$
(3)

The first step to solve the continuous optimization problem (3) is to approximate the cardinality constraint, i.e. \(\left\| x\right\| _{0}\). One can approximate it by \(\left\| x\right\| _{1}\) which is the usual heuristic. We approximated it as the negative log-likelihood of a Student t-distribution, which is a tighter approximation than \(\left\| x\right\| _{1}\) and has been used in many different contexts (Candes et al. 2007; Fazel et al. 2003; Sriperumbudur et al. 2011; Weston et al. 2003). There are other approximations to \(\left\| x\right\| _{0}\), e.g., \(\sum _{i=1}^{n}{(1- e^{-\alpha \left| x_i\right| })}\) where \(\alpha >0\) (Bradley and Mangasarian 1998).

If we define the approximation of the zero norm as

$$\begin{aligned} \left\| x \right\| _0 := \sum _{i=1}^{n} {1_{x_i\ne 0}} = \lim _{\epsilon \rightarrow 0} \sum _{i=1}^{n}\frac{\log {\left( 1+\left| x_i\right| /\epsilon \right) }}{\log {\left( 1+1/\epsilon \right) }}, \end{aligned}$$

then the problem (3) becomes

$$\begin{aligned} \min _{x\in R^n} x^{T}\tilde{G}x + \rho \lim _{\epsilon \rightarrow 0} \sum _{i=1}^{n}\frac{\log {\left( 1+\left| x_i\right| /\epsilon \right) }}{\log {\left( 1+1/\epsilon \right) }} \end{aligned}$$
(4)

Let us denote the set of symmetric matrices by \(\mathbf{S}^{n}\) and denote positive semidefinite and positive definite matrix by \(\mathbf{S}^{+}\) and \(\mathbf{S}^{++}\) respectively. Observe that the matrix \(\tilde{G}\) is a symmetric matrix since \(\tilde{G}_{ij} = \tilde{G}_{ji}\), and the indices \(ij\) and \(ji\) refer to the common errors between \(i^{th}\) and \(j^{th}\) classifier. Note that the problem (4) is a non convex unconstrained problem since the matrix \(\tilde{G}\notin \mathbf{S}^{+} \mathrm{or} \ \mathbf{R}^{++}\). Then we can not use the well known property of convex optimization [see Theorem below (Nocedal and Wright 2000)] which states the following

Theorem 1

(Nocedal and Wright 2000) Let \(x^*\) be the feasible point of the problem

$$\begin{aligned} \min _{x} f(x), \end{aligned}$$
(5)

where \(f(x)\) is convex differentiable function. If \(\nabla (f(x^*))=0\) then \(x^*\) is the global minimum of (5).

We will model the problem (4) by difference of convex (DC) functions since the objective function has the structure of DC. Note that the recent formulation (4) is a continuous optimization problem unlike the problem (2) which has a combinatorial term. Before giving DC formulation of the problem, lets define DC program as below:

Definition 1

Let \(\Omega \) be the convex set in \(R^n\) and \(f: \Omega \rightarrow R\) be a real valued function. Then, \(f\) is a DC function on \(\Omega \) if there exist two convex functions \(g,h: \Omega \rightarrow R\) such that \(f(x) =g(x)-h(x), x\in \Omega \). Optimization problems of the form

$$\begin{aligned}&\min _{x\in \Omega } f_0(x) \end{aligned}$$
(6)
$$\begin{aligned}&\hbox {s.t.} f_i(x)\le 0, \,i=1,\ldots ,m \end{aligned}$$
(7)

where \(f_i(x) = g_i(x)-h_i(x), \ i= 0,1,\ldots ,m\) is called DC programming.

In order to formulate (4) as DC program, let us choose \(\tau \in R\) such that \(\tilde{G} + \tau I \in S^{+}\). If \(\tilde{G}\in S^{+}\), such \(\tau \) exists trivially, choose \(\tau >0\). If \(\tilde{G}\) is indefinite, choosing \(\tau >-\lambda _{min}(\tilde{G})\) where \(\lambda _{min}\) is the smallest eigenvalue of \(\tilde{G}\) ensures that \(\tilde{G}+\tau I \in S^+\) . The similar approximation is performed for different concepts such as solving generalized eigenvalue problem in Sriperumbudur et al. (2011). Therefore, if we choose \(\tau > \max (0,-\lambda _{min})\), then we will have positive semi definite matrix for any \(\tilde{G}\in S^n\). Then the problem (4) can be written as

$$\begin{aligned} \min _{x} x^{T}\left( \tilde{G}+\tau I\right) x - \tau \left\| x\right\| ^2_2 + \rho \lim _{\epsilon \rightarrow 0} \sum _{i=1}^{n}\frac{\log {\left( 1+\left| x_i\right| /\epsilon \right) }}{\log {\left( 1+1/\epsilon \right) }}, \end{aligned}$$

where \(\left\| . \ \right\| _2\) is referred to Euclidean norm. The above problem can be approximated further by neglecting the term \(\lim \) and choosing \(\epsilon >0\). Hence the following convex unconstrained problem is obtained

$$\begin{aligned} \min _{x} x^{T}\left( \tilde{G}+\tau I\right) x - \tau \left\| x\right\| ^2_2 + \rho \sum _{i=1}^{n}\frac{\log {\left( 1+\left| x_i\right| /\epsilon \right) }}{\log {\left( 1+1/\epsilon \right) }}. \end{aligned}$$
(8)

With this new formulation, the first model introduced by Zhang et al. (2006) is made independent of the size of the subensemble \(k\) by penalizing the constraint in model (2) with a regularization constant \(\tau \). The size of subensemble \(k\) changes the accuracy of the model since the subensemble having small \(k\) can lack important classifiers or having too large \(k\) can include redundant classifiers. Testing for all \(k\) and choosing the best \(k\) by exhaustive search on the training set will make the algorithm too slow, especially as the number of classes increases. In the new formulation proposed in model (8), new parameters \(\tau \) and \(\rho \) are introduced where the first one is approximated by the minimum eigenvalue of the matrix \(\tilde{G}\) and the latter one is by well known statistical method cross validation. Here we choose the number folds as 5. The algorithm of the proposed method in this study is given by Algorithm 1 and is referred to “ECOC with UP”.

figure a

4 Experiments

We implemented our novel pruning method by MATLAB fminunc function from optimisation toolbox which solves the unconstrained optimization problem (Branch and Grace 2002). The new parameters \(\tau \) and \(\rho \) introduced in Eq. (8) are calculated by \(\tau =\lambda _{min}(\tilde{G})\) and by 5 fold cross validation respectively. First we derived the matrix \(\tilde{G}\) from the standard ECOC algorithm on the training set and then performed pruning by solving the optimization problem (8). It has been theoretically and experimentally proven that the randomly generated long or equi-distant code matrices give close to optimum performance when used with strong base classifiers (James and Hastie 1997; Windeatt and Ghaderi 2003). Thus, in this study coding of the ECOC matrix is performed by random generation such that each column of ECOC matrix has approximately equal number of \(-\)1s and +1s. The solution set of the problem (8) constitutes the reduced number of base learners for ECOC. We used UCI machine learning repository data sets of ecoli, glass, dermatology, yeast, wine and facial expression classification data.Footnote 2 As a base learner we used SVM with Gaussian kernels. Kernel parameters and regularisation constant of SVM are chosen by 10 fold cross validation. In Zhang et al. (2006), pruning size is set beforehand but in our formulation, it is part of the optimisation (8). Since the solution of the problem (8) consists of continuous values, we approximated the solution to binary values by well known heuristic rounding in combinatorial optimization. From the experiments we observe that the solution converges if a threshold is placed on the real output \(x\) of (8). The threshold is taken as \(x>0.5\) as indicated in Step 5 in Algorithm 1 where the ith classifier is included to the subensemble if \(x_i>0.5\).

In Tables 12 and 3, error rates and running time in seconds (given in parenthesis) are compared with different pruning methods REP and KP, ACEC in Zor et al. (2014), SDP in Zhang et al. (2006) that are adapted to ECOC and further random guessing results are reported. The highest accuracy is reported in bold in Tables 12 and 3. For statistical significance, student t test is performed to asses whether the error rate is in the confidence interval on 10 test folds and it is found that the error rates reported in Tables 12 and 3 are in confidence interval which are referred to CI. The statistical significance is found to be \(p>0.99\) with a confidence level of \(95\,\%\) and \(H=0\) which indicates that we should reject the null hypothesis. Our null hypthesis is “Error rates are random for 10 test folds”. The subensemle size \(k\) is determined from our pruning method Algorithm 1 by Step 5 where \(k\) refers to the number of non zero elements of vector \(x\) in Eq. (4) and it is fixed for the other pruning methods in order to make a meaningful comparison. The method “SDP” on which our algorithm is based on is parametric on the subensemble size \(k\). The error rate highly depends on \(k\). If \(k\) is too small it may give high error rate and if \(k\) is too big it may contain redundant classifiers in the subensemble which also affects error rate. So it is important to determine the best \(k\). Finding \(k\) heuristically will be time consuming since it is necessary to try all \(k\) values, i.e., \(k=1,2,\ldots n\). Thus comparison of error rates of “ECOC with UP” with the method SDP is not strictly fair since \(k\) in SDP is found in “ECOC with UP” beforehand. Likewise, comparison with ACEC is not strictly fair since it is an ordered aggregated method and cpu time of ACEC is higher than all methods compared in this study. Thus for large data sets, ACEC is not an efficient method although it gives higher accuracy.

Table 1 Mean error values and running times of 10 fold cross validation with 50 base classifiers, here k is the ensemble size after pruning
Table 2 Mean error values and running times of 10 fold cross validation with 100 base classifiers, here k is the ensemble size after pruning
Table 3 Mean error values and running times of 10 fold cross validation with 150 base classifiers, here k is the ensemble size after pruning

On the other hand, \(k\) is replaced with a new regularization parameter \(\rho \) introduced within penalization term of our approach “ECOC with UP” is determined by cross validation for values \(\rho = [ 1 \quad 10 \quad 100 \quad 500 \quad 1000]\) on training set. Furthermore, it is important to observe that running time for cross validation to determine \(\rho \) is less than carrying out heuristically to find \(k\).

We tested our algorithm on different size of pool of classifiers, i.e., different size of ECOC matrices, such as 50, 100, 150 base classifiers on 5 data sets from UCI machine learning repository (Blake and Merz 1998) and Facial Expression Classification (FAC) data (Kanade et al. 2000) which will be explained in next Sect. 4.1 . These results show that ACEC performs better in terms of error rate but it is very slow in running time since it is based on ordered aggregation. If we compare our approach “ECOC with UP” with other methods except ACEC, it performs mostly better than other pruning methods. Note that, SDP and ACEC are not applicable for large scale data as it can be seen on facial expression data “FAC” on each table. Observe that when any other pruning method except “ECOC with UP“ gives better results, it has always very slow running time. For instance, In Table 1 for FAC data, Full ECOC and Random Guessing give better error rate but the running time is greater than with ”ECOC with UP”. KP is still better than our method but the running time increases significantly. In Table 3, for the wine data, even though REP is significantly better than all, the running time is twice as long as “ECOC with UP“ method which has the second best error rate. Likewise, for the glass data in Table 3, “ECOC with UP“ has the second best result with a lower running time than Full ECOC, REP and KP. Especially, REP and KP can be very time consuming because of the combinatorial structure of the algorithm, even though they give better results in some cases, e.g., Table 3. As explained in Sects. 3.1 and  3.2, both of the algorithms go through all combinations to find the minimum ensemble error which makes these algorithms very slow. It should be also noted that pruning size must be determined as an input variable for all methods that we compared in this section.

4.1 Facial expression classification

Automatic facial expression recognition has applications in areas such as human–computer interaction, human emotion analysis, biometric authentication and fatigue detection. However, the emotions characterised in these different applications can be quite varied. While it is possible to learn emotions directly from features derived from raw images, an attractive and more generic approach is to decompose the face into discrete facial features. A commonly used method is the facial-action coding system (FACS) (Ekman and Friesen 1978; Tian et al. 2001), in which facial features are characterised as one of 44 types known as action units (AUs). The advantage of FACS is that facial expressions of interest (e.g. emotions or fatigue) can be learned by looking for particular groups of AUs so that the interpretation can be de-coupled from their detection. Figure 2 shows example AUs from the eye region.

Fig. 2
figure 2

Some example AUs and AU groups from the region around the eyes. AU1 inner brow raised, AU2 outer brow raised, AU4 brows lowered and drawn together, AU5 upper eyelids raised, AU6 cheeks raised, AU7 lower eyelids raised. The images are shown after manual eye location, cropping, scaling and histogram equalisation

It is possible to define a series of two-class problems, in which a classifier is trained to differentiate each AU from all other AUs (one versus rest). However, the presence of one AU may affect the strengths of other AUs, in other words not all AUs are linearly independent. In this paper, AU detection is posed as a single multiclass problem, in which groups of AUs are assigned a single class. Therefore a single AU may appear in more than one group.

In order to detect AUs, the images first need to be pre-processed. Multi-resolution local binary patterns (MLBP) are used for feature extraction (Smith and Windeatt 2010; Raja and Gong 2006), and the fast correlation-based filter (FCBF) algorithm (Yu and Liu 2004) is employed for feature selection. Further details of the pre-processing and normalisation procedures may be found in Smith and Windeatt (2011).

Results are presented for the Cohn-Kanade face expression database (Kanade et al. 2000), which contains frontal video clips of posed expression sequences from 97 university students. The last image has available ground truth in the form of a manual AU coding by human experts. We focused on detecting AUs from the upper face region as shown in Fig. 2. In order to avoid defining classes with very few training patterns, AU groups with three or fewer examples were ignored. This led to 456 images available for training and these were distributed across the 12 classes shown in Table 4.

Table 4 Classes of action unit groups used in the experiments

We applied the same procedure in ECOC and in pruning, as described in Sect. 3. ECOC is performed with 200 base classifiers, for which we used SVM with Gaussian kernel (Chang and Lin 2011). Each run was based on a different randomly chosen stratified training set with a 90/10 training/test split. The ECOC code matrices were randomly generated with balanced numbers of 1s and -1s in each column, as proposed by Schapire (1997). Experimental results of FAC data are compared with the proposed pruning algorithm in Tables 12 and 3, They show that appropriate subset of base learners gives approximately same error rate, so that fewer base learners leads to less training time, which is proportional to number of base learners.

5 Discussion and conclusion

In this study, we proposed a faster and more efficient pruning method for ECOC by optimizing the accuracy and diversity simultaneously in the proposed cost function. Our new algorithm prunes the set of base classifiers by solving a continuous optimization unlike in Zhang et al. (2006). One of the important aspects of the proposed method here is that the size of the pruned set comes directly from the optimization problem. The unconstrained optimization formulation given in Eq. (4) does not need the ensemble size and finds the optimum subensemble on the training set. In Zhang et al. (2006) and Zor et al. (2014), the pre-defined pruning size determines the error rate, while here it is determined by searching optimality conditions. For higher number of classes and higher number of base classifiers, the pruning will lead to a more efficient solution for multiclass classification. Different size of ECOC matrices are tested for 5 different pruning method and for full ECOC matrix without pruning. For most of the data ECOC with UP reduces the error in a smaller running time as shown in Tables 1, 2 and 3. It should be clarified that as decoding, we used Hamming distance and as a future work, we will apply other decoding methods proposed in Sect. 2.1.