Abstract
Encoding is one of the most important steps in Error Correcting Output Codes (ECOCs). Traditional encoding strategies are usually data-independent. Recently, some tree-form encoding algorithms are proposed which firstly utilize mutual information to estimate inter-class separability in order to create a hierarchical partition of the tree from top to down and then obtain a coding matrix. But such criterion is usually computed by a non-parametric method which would generally require vast samples and is more likely to lead to unstable results. In this paper, we present a novel encoding algorithm which uses the maximum margins between classes as the criterion and constructs a bottom-up binary tree based on the maximum margin. As a result, the corresponding coding matrix is more stable and discriminative for the following classification. Experimental results have shown that our algorithm performs much better than some state-of-the-art coding algorithms in ECOC.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The multi-class classification problem has attracted a lot attentions in machine learning field. The traditional solutions tend to transform it into multiple binary problems. The corresponding strategies include decision tree, neural networks, and so on.
Error Correcting Output Codes (ECOCs) [1, 2] is a widely-used method in these strategies, which was originally proposed by Dietterich and Bakiri [3]. It usually involves two parts: encoding and decoding. Encoding part generates a sequence of bits, i.e. a code word for each class. All code words form a coding matrix. Decoding part predicts class labels for unseen data through comparing their output code words with the code words of classes in the coding matrix depending on some specific strategies such as Hamming decoding (HD) [4] and Euclidean decoding [5]. In this paper, we will focus on encoding part.
The goal of encoding is to design a coding matrix M. Each row of M represents one class and each column of M is one binary problem (dichotomizer). For each class, encoding aims to create a corresponding code word where each bit is the prediction of the dichotomizer. Traditionally, the coding matrix is coded by +1 and −1. In Table 1(a), +1 means that the corresponding dichotomizer takes this class as a positive class and −1 otherwise. However, the length of the code words in this scenario is actually fixed. As a result, a limited number of dichotomizers can be used which would restrict the performance of ECOC to some extent. Allwein et al. [6] further presented a ternary coding matrix which allows some bits of coding matrix to be zero as Table 1(b). The symbol zero denotes that the corresponding class does not participate in the specific classification. Result from the zero symbols, the ternary coding matrix is more flexible and could have much longer code words than the binary one.
Consequently, the core task in encoding has boiled down to how to build such an appropriate coding matrix. The simplest strategy is one-versus-all (OVA) [4] which takes one class as a positive class and all the others as a negative class to build the binary coding matrix. One-versus-one (OVO) [5] forms a ternary coding matrix where each column only considers two classes to be positive and negative classes respectively and the rest are represented by zero symbols. Random codes [6] generate the coding matrix randomly, where the binary coding matrix is called dense random while the ternary one is termed as spare random. Though these traditional strategies are simple, they are all data-independent. As a result, they either perform poorly or get too long code words which would require more dichotomoizers with higher computational costs.
Recent proposed tree-form encoding algorithms [7–10] utilize some criterions to estimate inter-class separability so as to build a tree and obtains a data-dependent coding matrix. Discriminant ECOC (DECOC) [8] applies the sequential forward floating search (SFFS) to generate the tree from top to down through maximizing the mutual information (MI) [11] between classes heuristically. Then a ternary coding matrix is constructed according to the hierarchical partition of the tree. Based on DECOC, subclass ECOC (SECOC) [9] further uses a cluster method to create subclasses while the original classification problem is linearly non-separable. However, the MI criterion used in DECOC and SECOC is computed by a non-parametric method which generally requires a large number of samples and further leads to an unstable result. Hierarchical ECOC (HECOC) [12] utilizes support vector domain description (SVDD) [12] as the criterion to estimate inter-class separability, which is more stable than DECOC and SECOC. However, when building the tree, HECOC chooses two classes which have the smallest inter-class separability as a node. As a result, the base dichotomizers will face a relatively difficult binary classification problem which limits the performance of ECOC to some degree.
In this paper, we propose a novel encoding method termed as maximum margin tree ECOC (M\(^2\)ECOC). M\(^2\)ECOC estimates the maximal inter-class separability by the maximum margins between classes rather than the MI criterion. Consequently, the corresponding coding matrix is more stable and discriminative for the following classification. Concretely, M\(^2\)ECOC uses support vector machine (SVM) [13] to compute the maximum margins between classes and then obtains a maximum margin matrix. Depending on this matrix, M\(^2\)ECOC further generates a bottom-up binary tree based on choosing the maximal maximum margin. Finally, such maximum margin tree will be converted into a ternary coding matrix according to the hierarchical partition of the tree.
The paper is organized as follows. Section 2 introduces the tree-form encoding algorithms DECOC and SECOC. Section 3 introduces our M\(^2\)ECOC algorithm in detail. In Sect. 4, the compared experiments with some state-of-the-art encoding algorithms are shown. Finally, the last section concludes the paper.
2 Tree-Form Encoding Algorithms
2.1 Discriminant ECOC (DECOC)
DECOC [8] firstly applies SFFS to find the hierarchical partition of the tree and builds the tree from top to down. As Fig. 1(a) shown, DECOC separates the original class set \(\{C _1,C _2,C _3\}\) into two partitions \(\{C _1,C _3\}\) and \(\{C _2\}\) until each partition has only one class. SFFS is one kind of suboptimal sequential search methods which dynamically changes the number of forward steps until the resulting subsets are better than the previously ones based on some criterions [8]. MI, which is an often-used metric to compute the relativity between two random variables in information theory, is selected to evaluate the discriminability of class sets in DECOC. MI is defined as follow:
where x denotes the sample in the class sets and y denotes the class label. p(x) and p(y) are their probability density functions respectively. DECOC aims to maximize the MI value between the data in the class sets and the class label to maximize the discriminability of class sets. However, when computing the MI value, DECOC uses a non-parametric parzen estimation method which usually requires a large number of samples in order to reach a relatively better performance and is more likely to lead to unstable experimental results.
When the tree has been completely constructed, DECOC further fills a ternary coding matrix based on the tree. Particularly, the classes in the left partition are represented by +1 and the classes in the right partition are represented by −1. Meanwhile, classes which are not shown up in the hierarchical partition are represented by 0.
2.2 Subclass ECOC (SECOC)
On the basis of DECOC, SECOC [9] contributes to solving the linearly non-separable problems by dividing the original class into some new subclasses. SECOC also uses SFFS to find the hierarchical partition of the tree by maximizing the MI value. When the original partition is linearly non-separable, SECOC further uses the cluster method K-means to split it into simpler and smaller sub-partitions. Usually, the number of sub-partitions is set to 2 [9]. As Fig. 1(b) shown, SECOC splits the original linearly non-separable problem \(\{C_1, C _3\}\) into two linearly separable problems \(\{C_1, C_{3\_1}\}\) and \(\{C_1, C _{3\_2}\}\) by dividing class \(C_3\) into two new subclasses \(C_{3\_1}\) and \(C_{3\_2}\). Therefore, if the original classification problem is linearly non-separable, SECOC can transform it into a linearly separable one through several times of decompositions.
3 Maximum Margin Tree ECOC (M\(^2\)ECOC)
3.1 Maximum Margin
Margin, which is defined as the minimum distance between the decision boundary and samples, is one of the most famous concepts in SVM proposed by Vapnik [13].
Specifically, decision boundary which is also called decision hyperplane, is denoted as follows:
where \(\varPhi ({{\varvec{x}}})\) is a fixed feature-space transformation function, w is a weight vector and b is the bias. The functional margin can be formulated as:
where \(y _i\in {{\varvec{y}}}\) denotes the corresponding class label and N is the number of samples. However, the functional margin does not have the scaling invariance. So, we further get the geometric margin by normalizing (3):
According to (3) and (4), we can easily obtain the relationship with the functional margin and the geometric margin as follows:
Let \(\hat{r}\) equal to 1. The maximum margin can be optimized by solving the following problem:
It is obvious that the maximization of \(||{{\varvec{w}}}||^{-1}\) is equivalent to the minimization of \({||{{\varvec{w}}}||}\). So we can transform (6) into the following optimization problem:
where the parameter \(\varvec{\xi }\) is the slack variable and C is used to balance \(\varvec{\xi }\) and the margin. (7) can be further changed into a dual problem using Lagrange multipliers with kernel functions:
Through (8), we can solve the \(\varvec{\alpha }\). Consequently, the maximum margin can be finally computed as follows [14]:
where the vector w is determined by
3.2 Maximum Margin Matrix
Given a k-class classification problem, we can compute the maximum margin between each pair of classes according to (9). Then all maximum margins can be combined as a maximum margin matrix:
where \(m_{ij}\) is the maximum margin between the ith and jth classes. Obviously, this matrix is symmetric. So we just compute the values of upper triangular matrix elements. As can be seen, the bigger the value of \(m_{ij}\) is, the larger the maximum margin between these two classes would be. Furthermore, a larger maximum margin means that the corresponding two classes are more well-separated. Consequently, the maximum margin actually gives us a natural criterion to evaluate the discriminability between classes. In M\(^2\)ECOC, we will directly use the maximum margin to build the tree.
3.3 Maximum Margin Tree
Traditional tree algorithms such as DECOC and SECOC usually built the tree from top to down as Fig. 1 shown. In fact, such strategy emphasizes more on the discriminability between internal nodes, but ignores the discriminability between leaf nodes. For example, in Fig. 1(a), DECOC firstly separates the original class set \(\{C _1,C _2,C _3\}\) into two partitions \(\{C _1,C _3\}\) and \(\{C _2\}\) and then divides the internal node \(\{C _1,C _3\}\) into two leaf nodes \(\{C _1\}\) and \(\{C _3\}\). As a result, DECOC can guarantee that the internal partition between the internal node \(\{C _1,C _3\}\) and \(\{C _2\}\) has good discriminability and the corresponding dichotomizer in the internal node can achieve satisfactory performance. However, DECOC can not guarantee that the two leaf nodes \(\{C _1\}\) and \(\{C _3\}\) also have similarly good discriminability, which leads to the performance of the corresponding dichotomizer uncontrollable.
In M\(^2\)ECOC, we adopt a bottom-up strategy [10] to construct the maximum margin tree. In order to illustrate the strategy more clearly, we will take a five-class classification problem as an example in Fig. 2. Concretely, we firstly regard each class as a subclass and use (9) to compute the maximum margin matrix. According to this matrix, \(\{{C}_1\}\) and \(\{{C}_4\}\) have the maximal maximum margin. So we combine them as a new subclass (Fig. 2(a)) but still keep their original classes labels. Then the new subclass and the rest classes generate a new four-class classification problem. Repeating the above process, we take \(\{{C}_2\}\) and \(\{{C}_3\}\) between which have the maximal maximum margin as another new subclass (Fig. 2(b)). Consequently, the subclasses \(\{{C}_1,{C}_4\}\), \(\{{C}_2,{C}_3\}\) and \(\{{C}_5\}\) boil down to a new three-class classification problem. Following the same steps, the two subclasses \(\{{C}_1,{C}_4\}\) and \(\{{C}_2,{C}_3\}\) are integrated as a new subclass \(\{{C}_1,{C}_2,{C}_3,{C}_4\}\) (Fig. 2(c)). Particularly, when computing the margin between subclasses \(\{{C}_1,{C}_4\}\) and \(\{{C}_2,{C}_3\}\), the classes in the same subclass will be considered as one class temporarily. Finally the expected maximum margin tree can be obtained as Fig. 2(d).
After the optimal hierarchical partition of the maximum margin tree has been finished, we obtain the coding matrix M as follows:
where M(r,l) denotes the element lying in the rth row and the lth column in the coding matrix and the \(C _r\) denotes the rth class. \(P ^{left}_l \) and \(P ^{right}_l \) are the left and right partition of the lth partition respectively (Regardless of the root node). Table 2 lists the corresponding coding matrix of the above example following (11).
4 Experimental Results
In this section, we compare M\(^2\)ECOC with some state-of-the-art coding algorithms like OVA [4], OVO [5], dense random [6], sparse random [6], DECOC [8]Footnote 1, SECOC [9] and HECOC [10] to validate the superiority of our approach.
Ten multi-class datasets from common-used UCI datasets [15] are used in the experiments, that is, Wine (178,13,3), Lenses (24,4,3), Glass (214,9,6), Balance (625,4,3), Cmc (1473,9,3), Ecoli (332,6,6), Iris (150,4,3), Tea (151,5,3), Thyriod (215,5,3), Vehicle (846,18,4), where the numbers of samples, dimension and classes are listed in the bracket. We randomly split each dataset into two non-overlapping training and testing set. The training set contains almost seventy percent of the samples and the rest samples are composed as the testing set. The whole process is repeated ten times. The average accuracies are also reported.
Moreover, in dense and sparse random algorithms, all random matrices are selected from a set of 10000 randomly generated matrices where P(1) = P(−1) = 0.5 for the dense random matrix as well as P(1) = P(−1) = 0.25 and P(0) = 0.5 for the sparse random matrix [6]. In SECOC, the parameter set \(\varTheta =\{\varTheta _{size},\varTheta _{perf},\varTheta _{impr}\}\) is fixed to \(\varTheta _{size}=\frac{|J|}{50}\), \(\varTheta _{perf}=0\) and \(\varTheta _{impr}=0.95\) according to [9]. The regularization parameter C and the width \(\sigma \) in radial basis function kernel in HECOC and M\(^2\)ECOC are selected from the interval {\(2^{-6}, 2^{-5},..., 2^5, 2^6\)} by cross-validation.
The decoding strategy HD is used to evaluate the performance of different coding algorithms. Two base classifiers Nearest Mean Classifier (NMC) and SVM with radial basis function kernel are applied as the dichotomizers, where the regularization parameter C is set to 1 [9]. Moreover, the width \(\sigma \) in the kernel is also selected from the same interval in HECOC and M\(^2\)ECOC.
The classification results on the ten datasets are reported in Tables 3 and 4. From the tables, we can see that M\(^2\)ECOC can reach better or comparable performance than compared algorithms on most datasets. Especially, the accuracies of M\(^2\)ECOC exceed the other algorithms’ accuracies beyond 3 % on the Glass and Vehicle sets with NMC in Table 3. In Table 4, Its accuracy even excels the other accuracies nearly 12 % on the Lenses set with SVM. Furthermore, we also list the average accuracies and standard deviations on all the datasets in the bottom of the tables. It obviously can be seen that M\(^2\)ECOC possesses the best performance compared with the other algorithms, which further indicates the superiority of M\(^2\)ECOC. On the contrary, DECOC and SECOC perform much poorly with SVM in some datasets. For example, their accuracies are even 10 % lower than the other algorithms’ accuracies on the Lenses, Balance, Iris sets in Table 4. The reason lies more on they are more sensitive to different base classifier and their using a non-parametric estimation to compute the MI value, which indeed requires numerous training data to achieve acceptable results.
In order to further statistically measure the significance of performance difference, the pairwise t-tests [16] at 95 % significance level are conducted between the algorithms. Specifically, whenever M\(^2\)ECOC achieves significantly better/worse performance than the compared algorithms on most datasets, a win/loss is counted and a marker \(\bullet /\circ \) are shown. Otherwise, a tie is counted and no marker is given. The resulting win/tie/loss counts for M\(^2\)ECOC against the compared algorithms are provided in the last line of Tables 3 and 4. As the tables shown, M\(^2\)ECOC can achieve statistically better or comparable performance on most datasets, which just accords with our conclusion.
5 Conclusion
In this paper, we present a novel encoding algorithm M\(^2\)ECOC for ECOC. Different from the existing tree-form encoding algorithms, M\(^2\)ECOC directly utilizes the maximum margin which actually is a natural criterion to evaluate the discriminability between classes to get the optimal hierarchical partition of the tree. Specifically, M\(^2\)ECOC regards each class as a subclass and computes the maximum margin matrix. According to this matrix, the classes with the maximal maximum margin are selected to combine as a new subclass. Then the new subclass and the rest classes generate a new multi-class classification problem. Repeating the same steps until all classes in one subclass. M\(^2\)ECOC constructs the maximum margin tree in a bottom-up manner and the corresponding coding matrix can be obtained easily by the tree. The experimental results on several UCI datasets have shown that M\(^2\)ECOC is superior to some state-of-the-art ECOC encoding algorithms, which further validates that the maximum margin is indeed an effective criterion for building the tree in ECOC.
Notes
- 1.
We download the DECOC code from http://jmlr.csail.mit.edu/papers/v11/escalera10a.html which was provided by Sergio Escalera, Oriol Pujol, Petia Radeva in 2010.
References
Japkowicz, N., Barnabe-Lortie, V., Horvatic, S., et al.: Multi-class learning using data driven ECOC with deep search and re-balancing. In: IEEE International Conference on DSAA, pp. 1–10 (2015)
Liu, M., Zhang, D., Chen, S., et al.: Joint binary classifier learning for ECOC-based multi-class classification. IEEE Trans. Pattern Anal. Mach. Intell. 1, 0162–8828 (2015)
Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. J. Artif. Intell. Res. 2, 263–286 (1995)
Nilsson, N.J.: Learning Machines: Foundations of Trainable Pattern-Classifying Systems. McGraw-Hill, New York (1965)
Hastie, T., Tibshirani, R.: Classification by pairwise coupling. Ann. Stat. 26(2), 451–471 (1998)
Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multiclass to binary: a unifying approach for margin classifiers. J. Mach. Learn. Res. 1, 113–141 (2001)
Escalera, S., Pujol, O., Radeva, P.: Boosted landmarks of contextual descriptors and forest-ECOC: a novel framework to detect and classify objects in cluttered scenes. Pattern Recogn. Lett. 28(13), 1759–1768 (2007)
Pujol, O., Radeva, P., Vitrial, J.: Discriminate ECOC: a heuristic method for application dependent design of error correcting output codes. IEEE Trans. Pattern Anal. Mach. Intell. 28(6), 1001–1007 (2006)
Escalera, S., Tax, D.M.J., Pujol, O., et al.: Subclass problem-dependent design for error-correcting output codes. IEEE. Trans. Pattern Anal. Mach. Intell. 30(6), 1041–1054 (2008)
Lei, L., Wang, X., Luo, X., et al.: Hierarchical error-correcting output codes based on SVDD. J. Syst. Eng. Electron. 37(8), 1916–1921 (2015). (In Chinese)
Principe, J.C., Xu, D., Fisher, J.: Information theoretic learning. In: Unsupervised Adaptive Filtering, vol. 1, pp. 265–319 (2000)
Tax, D.M.J., Duin, R.P.W.: Support vector domain description. Pattern Recogn. Lett. 20(11), 1191–1199 (1999)
Cortes, C., Vapnik, V., Guyaon, I.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Lu, M., Huo, J., Chen, C.L.P., et al.: Multi-stage decision tree based on inter-class and inner-class margin of SVM. In: IEEE International Conference on SYST, pp. 1875–1880 (2009)
Asuncion, A., Newman, D.: UCI Machine Learning Repository (2007)
Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, New York (2004)
Acknowledgements
This work was supported by National Natural Science Foundation of China (Grant Nos. 61375057, 61300165 and 61403193) and Natural Science Foundation of Jiangsu Province of China (Grant No. BK20131298). Furthermore, the work was also supported by Collaborative Innovation Center of Wireless Communications Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Zheng, F., Xue, H., Chen, X., Wang, Y. (2016). Maximum Margin Tree Error Correcting Output Codes. In: Booth, R., Zhang, ML. (eds) PRICAI 2016: Trends in Artificial Intelligence. PRICAI 2016. Lecture Notes in Computer Science(), vol 9810. Springer, Cham. https://doi.org/10.1007/978-3-319-42911-3_57
Download citation
DOI: https://doi.org/10.1007/978-3-319-42911-3_57
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42910-6
Online ISBN: 978-3-319-42911-3
eBook Packages: Computer ScienceComputer Science (R0)