Keywords

1 Introduction

The multi-class classification problem has attracted a lot attentions in machine learning field. The traditional solutions tend to transform it into multiple binary problems. The corresponding strategies include decision tree, neural networks, and so on.

Error Correcting Output Codes (ECOCs) [1, 2] is a widely-used method in these strategies, which was originally proposed by Dietterich and Bakiri [3]. It usually involves two parts: encoding and decoding. Encoding part generates a sequence of bits, i.e. a code word for each class. All code words form a coding matrix. Decoding part predicts class labels for unseen data through comparing their output code words with the code words of classes in the coding matrix depending on some specific strategies such as Hamming decoding (HD) [4] and Euclidean decoding [5]. In this paper, we will focus on encoding part.

The goal of encoding is to design a coding matrix M. Each row of M represents one class and each column of M is one binary problem (dichotomizer). For each class, encoding aims to create a corresponding code word where each bit is the prediction of the dichotomizer. Traditionally, the coding matrix is coded by +1 and −1. In Table 1(a), +1 means that the corresponding dichotomizer takes this class as a positive class and −1 otherwise. However, the length of the code words in this scenario is actually fixed. As a result, a limited number of dichotomizers can be used which would restrict the performance of ECOC to some extent. Allwein et al. [6] further presented a ternary coding matrix which allows some bits of coding matrix to be zero as Table 1(b). The symbol zero denotes that the corresponding class does not participate in the specific classification. Result from the zero symbols, the ternary coding matrix is more flexible and could have much longer code words than the binary one.

Table 1. Coding matrix for a 4-class problem

Consequently, the core task in encoding has boiled down to how to build such an appropriate coding matrix. The simplest strategy is one-versus-all (OVA) [4] which takes one class as a positive class and all the others as a negative class to build the binary coding matrix. One-versus-one (OVO) [5] forms a ternary coding matrix where each column only considers two classes to be positive and negative classes respectively and the rest are represented by zero symbols. Random codes [6] generate the coding matrix randomly, where the binary coding matrix is called dense random while the ternary one is termed as spare random. Though these traditional strategies are simple, they are all data-independent. As a result, they either perform poorly or get too long code words which would require more dichotomoizers with higher computational costs.

Recent proposed tree-form encoding algorithms [710] utilize some criterions to estimate inter-class separability so as to build a tree and obtains a data-dependent coding matrix. Discriminant ECOC (DECOC) [8] applies the sequential forward floating search (SFFS) to generate the tree from top to down through maximizing the mutual information (MI) [11] between classes heuristically. Then a ternary coding matrix is constructed according to the hierarchical partition of the tree. Based on DECOC, subclass ECOC (SECOC) [9] further uses a cluster method to create subclasses while the original classification problem is linearly non-separable. However, the MI criterion used in DECOC and SECOC is computed by a non-parametric method which generally requires a large number of samples and further leads to an unstable result. Hierarchical ECOC (HECOC) [12] utilizes support vector domain description (SVDD) [12] as the criterion to estimate inter-class separability, which is more stable than DECOC and SECOC. However, when building the tree, HECOC chooses two classes which have the smallest inter-class separability as a node. As a result, the base dichotomizers will face a relatively difficult binary classification problem which limits the performance of ECOC to some degree.

In this paper, we propose a novel encoding method termed as maximum margin tree ECOC (M\(^2\)ECOC). M\(^2\)ECOC estimates the maximal inter-class separability by the maximum margins between classes rather than the MI criterion. Consequently, the corresponding coding matrix is more stable and discriminative for the following classification. Concretely, M\(^2\)ECOC uses support vector machine (SVM) [13] to compute the maximum margins between classes and then obtains a maximum margin matrix. Depending on this matrix, M\(^2\)ECOC further generates a bottom-up binary tree based on choosing the maximal maximum margin. Finally, such maximum margin tree will be converted into a ternary coding matrix according to the hierarchical partition of the tree.

The paper is organized as follows. Section 2 introduces the tree-form encoding algorithms DECOC and SECOC. Section 3 introduces our M\(^2\)ECOC algorithm in detail. In Sect. 4, the compared experiments with some state-of-the-art encoding algorithms are shown. Finally, the last section concludes the paper.

2 Tree-Form Encoding Algorithms

2.1 Discriminant ECOC (DECOC)

DECOC [8] firstly applies SFFS to find the hierarchical partition of the tree and builds the tree from top to down. As Fig. 1(a) shown, DECOC separates the original class set \(\{C _1,C _2,C _3\}\) into two partitions \(\{C _1,C _3\}\) and \(\{C _2\}\) until each partition has only one class. SFFS is one kind of suboptimal sequential search methods which dynamically changes the number of forward steps until the resulting subsets are better than the previously ones based on some criterions [8]. MI, which is an often-used metric to compute the relativity between two random variables in information theory, is selected to evaluate the discriminability of class sets in DECOC. MI is defined as follow:

$$\begin{aligned} I ({{\varvec{x}}},{{\varvec{y}}})=\int \int p (x ,y )\log (\frac{p (x ,y )}{p (x )p (y )})\,dx \,dy \end{aligned}$$
(1)

where x denotes the sample in the class sets and y denotes the class label. p(x) and p(y) are their probability density functions respectively. DECOC aims to maximize the MI value between the data in the class sets and the class label to maximize the discriminability of class sets. However, when computing the MI value, DECOC uses a non-parametric parzen estimation method which usually requires a large number of samples in order to reach a relatively better performance and is more likely to lead to unstable experimental results.

When the tree has been completely constructed, DECOC further fills a ternary coding matrix based on the tree. Particularly, the classes in the left partition are represented by +1 and the classes in the right partition are represented by −1. Meanwhile, classes which are not shown up in the hierarchical partition are represented by 0.

2.2 Subclass ECOC (SECOC)

On the basis of DECOC, SECOC [9] contributes to solving the linearly non-separable problems by dividing the original class into some new subclasses. SECOC also uses SFFS to find the hierarchical partition of the tree by maximizing the MI value. When the original partition is linearly non-separable, SECOC further uses the cluster method K-means to split it into simpler and smaller sub-partitions. Usually, the number of sub-partitions is set to 2 [9]. As Fig. 1(b) shown, SECOC splits the original linearly non-separable problem \(\{C_1, C _3\}\) into two linearly separable problems \(\{C_1, C_{3\_1}\}\) and \(\{C_1, C _{3\_2}\}\) by dividing class \(C_3\) into two new subclasses \(C_{3\_1}\) and \(C_{3\_2}\). Therefore, if the original classification problem is linearly non-separable, SECOC can transform it into a linearly separable one through several times of decompositions.

Fig. 1.
figure 1

Illustration of the trees in DECOC and SECOC

3 Maximum Margin Tree ECOC (M\(^2\)ECOC)

3.1 Maximum Margin

Margin, which is defined as the minimum distance between the decision boundary and samples, is one of the most famous concepts in SVM proposed by Vapnik [13].

Specifically, decision boundary which is also called decision hyperplane, is denoted as follows:

$$\begin{aligned} {{\varvec{w}}}^T\varPhi ({{\varvec{x}}})+b =0 \end{aligned}$$
(2)

where \(\varPhi ({{\varvec{x}}})\) is a fixed feature-space transformation function, w is a weight vector and b is the bias. The functional margin can be formulated as:

$$\begin{aligned} \hat{r}=\min \limits _i\{y _i({{\varvec{w}}}^T\varPhi ({{\varvec{x}}}_i)+b )\} \quad i=1,2,...,N \end{aligned}$$
(3)

where \(y _i\in {{\varvec{y}}}\) denotes the corresponding class label and N is the number of samples. However, the functional margin does not have the scaling invariance. So, we further get the geometric margin by normalizing (3):

$$\begin{aligned} \tilde{r}=\min \limits _i\{y _i(\frac{{{\varvec{w}}}^T}{||{{\varvec{w}}}||}\varPhi ({{\varvec{x}}}_i)+\frac{b }{||{{\varvec{w}}}||})\} \quad i=1,2,...,N \end{aligned}$$
(4)

According to (3) and (4), we can easily obtain the relationship with the functional margin and the geometric margin as follows:

$$\begin{aligned} \tilde{r}=\frac{\hat{r}}{||{{\varvec{w}}}||} \end{aligned}$$
(5)

Let \(\hat{r}\) equal to 1. The maximum margin can be optimized by solving the following problem:

$$\begin{aligned}&\max \limits _{{{\varvec{w}}},b }\quad \frac{1}{||{{\varvec{w}}}||}\\&\,\,\, s.t.\quad y _i({{\varvec{w}}}^T\varPhi ({{\varvec{x}}}_i)+b )-1\geqslant 0, \quad i=1,2,...,N \nonumber \end{aligned}$$
(6)

It is obvious that the maximization of \(||{{\varvec{w}}}||^{-1}\) is equivalent to the minimization of \({||{{\varvec{w}}}||}\). So we can transform (6) into the following optimization problem:

$$\begin{aligned}&\min \limits _{{{\varvec{w}}},b }\quad \frac{1}{2}{||{{\varvec{w}}}||}^2+C \sum _{i=1}^N \xi _i\\&\,\,\, s.t.\quad y _i({{\varvec{w}}}^T\varPhi ({{\varvec{x}}}_i)+b )\geqslant 1-\xi _i, \quad i=1,2,...,N \nonumber \\&\quad \quad \quad \xi _i\geqslant 0, \quad i=1,2,...,N \nonumber \end{aligned}$$
(7)

where the parameter \(\varvec{\xi }\) is the slack variable and C is used to balance \(\varvec{\xi }\) and the margin. (7) can be further changed into a dual problem using Lagrange multipliers with kernel functions:

$$\begin{aligned}&\max \limits _{\varvec{\alpha }} \quad \sum _{i=1}^N\alpha _i-\frac{1}{2}\sum _{i,j=1}^N\alpha _i\alpha _jy _iy _jK({{\varvec{x}}}_i,{{\varvec{x}}}_j)\\&\,\,\, s.t.\quad \sum _{i=1}^N \alpha _iy _i=0, \quad i=1,2,...,N\nonumber \\&\quad \quad \quad 0\leqslant \alpha _i\leqslant C , \quad i=1,2,...,N\nonumber \end{aligned}$$
(8)

Through (8), we can solve the \(\varvec{\alpha }\). Consequently, the maximum margin can be finally computed as follows [14]:

$$\begin{aligned} margin=\frac{1}{||{{\varvec{w}}}||} \end{aligned}$$
(9)

where the vector w is determined by

$$\begin{aligned} {{\varvec{w}}}=\sum _{i=1}^N\alpha _iy _i\mathbf x _i \end{aligned}$$
(10)

3.2 Maximum Margin Matrix

Given a k-class classification problem, we can compute the maximum margin between each pair of classes according to (9). Then all maximum margins can be combined as a maximum margin matrix:

$$\left[ \begin{array}{ccccc} 0&{}m_{12}&{}...&{}m_{1(k-1)}&{}m_{1k}\\ m_{21}&{}0&{}...&{}m_{2(k-1)}&{}m_{2k}\\ ... &{}... &{}...&{}... &{}...\\ m_{(k-1)1}&{}m_{(k-1)2}&{}...&{}0&{}m_{(k-1)k}\\ m_{k1}&{}m_{k2}&{}...&{}m_{k(k-1)}&{}0\end{array}\right] $$

where \(m_{ij}\) is the maximum margin between the ith and jth classes. Obviously, this matrix is symmetric. So we just compute the values of upper triangular matrix elements. As can be seen, the bigger the value of \(m_{ij}\) is, the larger the maximum margin between these two classes would be. Furthermore, a larger maximum margin means that the corresponding two classes are more well-separated. Consequently, the maximum margin actually gives us a natural criterion to evaluate the discriminability between classes. In M\(^2\)ECOC, we will directly use the maximum margin to build the tree.

3.3 Maximum Margin Tree

Traditional tree algorithms such as DECOC and SECOC usually built the tree from top to down as Fig. 1 shown. In fact, such strategy emphasizes more on the discriminability between internal nodes, but ignores the discriminability between leaf nodes. For example, in Fig. 1(a), DECOC firstly separates the original class set \(\{C _1,C _2,C _3\}\) into two partitions \(\{C _1,C _3\}\) and \(\{C _2\}\) and then divides the internal node \(\{C _1,C _3\}\) into two leaf nodes \(\{C _1\}\) and \(\{C _3\}\). As a result, DECOC can guarantee that the internal partition between the internal node \(\{C _1,C _3\}\) and \(\{C _2\}\) has good discriminability and the corresponding dichotomizer in the internal node can achieve satisfactory performance. However, DECOC can not guarantee that the two leaf nodes \(\{C _1\}\) and \(\{C _3\}\) also have similarly good discriminability, which leads to the performance of the corresponding dichotomizer uncontrollable.

Fig. 2.
figure 2

Construction of a bottom-up maximum margin tree in M\(^2\)ECOC

In M\(^2\)ECOC, we adopt a bottom-up strategy [10] to construct the maximum margin tree. In order to illustrate the strategy more clearly, we will take a five-class classification problem as an example in Fig. 2. Concretely, we firstly regard each class as a subclass and use (9) to compute the maximum margin matrix. According to this matrix, \(\{{C}_1\}\) and \(\{{C}_4\}\) have the maximal maximum margin. So we combine them as a new subclass (Fig. 2(a)) but still keep their original classes labels. Then the new subclass and the rest classes generate a new four-class classification problem. Repeating the above process, we take \(\{{C}_2\}\) and \(\{{C}_3\}\) between which have the maximal maximum margin as another new subclass (Fig. 2(b)). Consequently, the subclasses \(\{{C}_1,{C}_4\}\), \(\{{C}_2,{C}_3\}\) and \(\{{C}_5\}\) boil down to a new three-class classification problem. Following the same steps, the two subclasses \(\{{C}_1,{C}_4\}\) and \(\{{C}_2,{C}_3\}\) are integrated as a new subclass \(\{{C}_1,{C}_2,{C}_3,{C}_4\}\) (Fig. 2(c)). Particularly, when computing the margin between subclasses \(\{{C}_1,{C}_4\}\) and \(\{{C}_2,{C}_3\}\), the classes in the same subclass will be considered as one class temporarily. Finally the expected maximum margin tree can be obtained as Fig. 2(d).

After the optimal hierarchical partition of the maximum margin tree has been finished, we obtain the coding matrix M as follows:

$$\begin{aligned} M(r,l)=\left\{ \begin{array}{rcl} 1 &{} &{} {C_r\in P^{left}_l}\\ 0 &{} &{} {C_r\notin P_l}\\ -1 &{} &{} {C_r\in P^{right}_l}\\ \end{array} \right. \end{aligned}$$
(11)

where M(r,l) denotes the element lying in the rth row and the lth column in the coding matrix and the \(C _r\) denotes the rth class. \(P ^{left}_l \) and \(P ^{right}_l \) are the left and right partition of the lth partition respectively (Regardless of the root node). Table 2 lists the corresponding coding matrix of the above example following (11).

Table 2. Coding matrix of the example in Fig. 2

4 Experimental Results

In this section, we compare M\(^2\)ECOC with some state-of-the-art coding algorithms like OVA [4], OVO [5], dense random [6], sparse random [6], DECOC [8]Footnote 1, SECOC [9] and HECOC [10] to validate the superiority of our approach.

Ten multi-class datasets from common-used UCI datasets [15] are used in the experiments, that is, Wine (178,13,3), Lenses (24,4,3), Glass (214,9,6), Balance (625,4,3), Cmc (1473,9,3), Ecoli (332,6,6), Iris (150,4,3), Tea (151,5,3), Thyriod (215,5,3), Vehicle (846,18,4), where the numbers of samples, dimension and classes are listed in the bracket. We randomly split each dataset into two non-overlapping training and testing set. The training set contains almost seventy percent of the samples and the rest samples are composed as the testing set. The whole process is repeated ten times. The average accuracies are also reported.

Moreover, in dense and sparse random algorithms, all random matrices are selected from a set of 10000 randomly generated matrices where P(1) = P(−1) = 0.5 for the dense random matrix as well as P(1) = P(−1) = 0.25 and P(0) = 0.5 for the sparse random matrix [6]. In SECOC, the parameter set \(\varTheta =\{\varTheta _{size},\varTheta _{perf},\varTheta _{impr}\}\) is fixed to \(\varTheta _{size}=\frac{|J|}{50}\), \(\varTheta _{perf}=0\) and \(\varTheta _{impr}=0.95\) according to [9]. The regularization parameter C and the width \(\sigma \) in radial basis function kernel in HECOC and M\(^2\)ECOC are selected from the interval {\(2^{-6}, 2^{-5},..., 2^5, 2^6\)} by cross-validation.

The decoding strategy HD is used to evaluate the performance of different coding algorithms. Two base classifiers Nearest Mean Classifier (NMC) and SVM with radial basis function kernel are applied as the dichotomizers, where the regularization parameter C is set to 1 [9]. Moreover, the width \(\sigma \) in the kernel is also selected from the same interval in HECOC and M\(^2\)ECOC.

Table 3. Classification results (mean\(\,\pm \,\)std) of NMC and HD on ten datasets (\(\bullet /\circ \) indicates that our algorithm is significantly better or worse than other algorithms based on the t-test at 95 % significance level)
Table 4. Classification results (mean\(\,\pm \,\)std) of SVM and HD on ten datasets (\(\bullet /\circ \) indicates that our algorithm is significantly better or worse than other algorithms based on the t-test at 95 % significance level)

The classification results on the ten datasets are reported in Tables 3 and 4. From the tables, we can see that M\(^2\)ECOC can reach better or comparable performance than compared algorithms on most datasets. Especially, the accuracies of M\(^2\)ECOC exceed the other algorithms’ accuracies beyond 3 % on the Glass and Vehicle sets with NMC in Table 3. In Table 4, Its accuracy even excels the other accuracies nearly 12 % on the Lenses set with SVM. Furthermore, we also list the average accuracies and standard deviations on all the datasets in the bottom of the tables. It obviously can be seen that M\(^2\)ECOC possesses the best performance compared with the other algorithms, which further indicates the superiority of M\(^2\)ECOC. On the contrary, DECOC and SECOC perform much poorly with SVM in some datasets. For example, their accuracies are even 10 % lower than the other algorithms’ accuracies on the Lenses, Balance, Iris sets in Table 4. The reason lies more on they are more sensitive to different base classifier and their using a non-parametric estimation to compute the MI value, which indeed requires numerous training data to achieve acceptable results.

In order to further statistically measure the significance of performance difference, the pairwise t-tests [16] at 95 % significance level are conducted between the algorithms. Specifically, whenever M\(^2\)ECOC achieves significantly better/worse performance than the compared algorithms on most datasets, a win/loss is counted and a marker \(\bullet /\circ \) are shown. Otherwise, a tie is counted and no marker is given. The resulting win/tie/loss counts for M\(^2\)ECOC against the compared algorithms are provided in the last line of Tables 3 and 4. As the tables shown, M\(^2\)ECOC can achieve statistically better or comparable performance on most datasets, which just accords with our conclusion.

5 Conclusion

In this paper, we present a novel encoding algorithm M\(^2\)ECOC for ECOC. Different from the existing tree-form encoding algorithms, M\(^2\)ECOC directly utilizes the maximum margin which actually is a natural criterion to evaluate the discriminability between classes to get the optimal hierarchical partition of the tree. Specifically, M\(^2\)ECOC regards each class as a subclass and computes the maximum margin matrix. According to this matrix, the classes with the maximal maximum margin are selected to combine as a new subclass. Then the new subclass and the rest classes generate a new multi-class classification problem. Repeating the same steps until all classes in one subclass. M\(^2\)ECOC constructs the maximum margin tree in a bottom-up manner and the corresponding coding matrix can be obtained easily by the tree. The experimental results on several UCI datasets have shown that M\(^2\)ECOC is superior to some state-of-the-art ECOC encoding algorithms, which further validates that the maximum margin is indeed an effective criterion for building the tree in ECOC.