Solving for multi-class using orthogonal coding matrices

Mills, Peter

doi:10.1007/s42452-019-1437-9

Solving for multi-class using orthogonal coding matrices

Research Article
Published: 21 October 2019

Volume 1, article number 1451, (2019)
Cite this article

Download PDF

SN Applied Sciences Aims and scope Submit manuscript

Solving for multi-class using orthogonal coding matrices

Download PDF

Peter Mills ORCID: orcid.org/0000-0001-5403-8459¹

1332 Accesses
2 Citations
3 Altmetric
Explore all metrics

Abstract

A common method of generalizing binary to multi-class classification is the error correcting code (ECC). ECCs may be optimized in a number of ways, for instance by making them orthogonal. Here we test two types of orthogonal ECCs on seven different datasets using three types of binary classifier and compare them with three other multi-class methods: 1 versus 1, one-versus-the-rest and random ECCs. The first type of orthogonal ECC, in which the codes contain no zeros, admits a fast and simple method of solving for the probabilities. Orthogonal ECCs are always more accurate than random ECCs as predicted by recent literature. Improvments in uncertainty coefficient (U.C.) range between 0.4 and 17.5% (0.004–0.139, absolute), while improvements in Brier score between 0.7 and 10.7%. Unfortunately, orthogonal ECCs are rarely more accurate than 1 versus 1. Disparities are worst when the methods are paired with logistic regression, with orthogonal ECCs never beating 1 versus 1. When the methods are paired with SVM, the losses are less significant, peaking at 1.5%, relative, 0.011 absolute in uncertainty coefficient and 6.5% in Brier scores. Orthogonal ECCs are always the fastest of the five multi-class methods when paired with linear classifiers. When paired with a piecewise linear classifier, whose classification speed does not depend on the number of training samples, classifications using orthogonal ECCs were always more accurate than the other methods and also faster than 1 versus 1. Losses against 1 versus 1 here were higher, peaking at 1.9% (0.017, absolute), in U.C. and 39% in Brier score. Gains in speed ranged between 1.1% and over 100%. Whether the speed increase is worth the penalty in accuracy will depend on the application.

Reliability Maps: A Tool to Enhance Probability Estimates and Improve Classification Accuracy

Comparison of Different Dichotomous Classification Algorithms

Article 01 July 2020

Optimisation of multiclass supervised classification based on using output codes with error-correcting

Article 01 April 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Many methods of statistical classication can only discriminate between two classes. Examples include lineear classifiers such as perceptrons and logistic regression [1], piecewise linear classifiers [2, 3], as well as support vector machines [4]. There are many ways of generalizing binary classification to multi-class and the number of possibilities increases exponentially with the number of classes.

One should distinguish between multi-class methods that use only a subset of the binary classifiers, adding more as the algorithm narrows down the class, and those that use all of the binary classifiers, combining the results or solving for the class probabilities. In the former category, we have hierarchical multi-class classifiers such as decision trees [5, 6] and decision directed acyclic graphs (DDACs) [7]. In the latter category, two common methods are one-versus-one (1 vs. 1) and one-versus-the-rest (1 vs. rest) [8]. These in turn generalize to error-correcting codes (ECCs) [9].

Early experiments with ECCs used random codes: the assumption is that if the codes are long enough (there are enough binary classifiers) they will adequately span the classes. Later work focused on optimizing the design of the codes: what type of codes will best span the classes and produce the most accurate results? Here we can also distinguish between two types: those that use the data to help design the codes [10,11,12] and those that are independent of the data but use the mathematical properties of the codes themselves to aid in their construction [13,14,15]. It is these latter type of optimized error-correcting codes we turn to in this note.

In error-correcting coding, there is a coding matrix, A, that specifies how the set of multiple classes is partitioned for each binary classifier. For a given column, if members of the jth class are to be labeled $-1/+1$ for the binary classifier, then the jth row is assigned a $-1/+1$. If the jth class is left out, then the jth row is assigned a 0. Typically, the class of the test point is determined by the distance between a row in the matrix and a vector of binary decision functions:

$$\begin{aligned} c(\mathbf {x}) = \arg \min _i | \mathbf {a}_i - \mathbf {r}(\mathbf {x}) | \end{aligned}$$

(1)

where $\mathbf {a}_i\in \lbrace -1,0,+1 \rbrace$ is the ith row of the coding matrix and $\mathbf {r}$ is a vector of decision functions at test point, $\mathbf {x}$. If we take the upright brackets as a Euclidean distance we can expand (1) as follows:

$$\begin{aligned} c = \arg \min _i \sum _j \left( | \mathbf {a}_i | + | \mathbf {r} | - 2 \mathbf {a}_i \cdot \mathbf {r} \right) \end{aligned}$$

Since $| \mathbf {r} |$ is constant over i, it may be removed from the expression. Also, for the purposes of this note, each row of A will be given the same number of non-zero entries, hence:

$$\begin{aligned} | \mathbf {a}_i | = | \mathbf {a}_j | = const. \end{aligned}$$

This is most evident for the case in which each binary classifier partitions all of the classes so that there are no zeros in A as is the case for the one-versus-the-rest partitioning. Then (1) reduces to a voting solution:

$$\begin{aligned} c = \arg \max A \mathbf {r} \end{aligned}$$

(2)

Both [13] and [14] show that to maximize the accuracy of an ECC, the distance between each row, $|\mathbf {a}_i - \mathbf {a}_j|_{i \ne j}$, should be maximized. Using the above assumptions, this reduces to:

$$\begin{aligned} \min |\mathbf {a}_i \cdot \mathbf {a}_j|_{i \ne j} \end{aligned}$$

Note the absolute value prevents degenerate rows. In other words, the coding matrix, A, should be orthogonal.

In this note, we describe a fast and simple algorithm that uses orthogonal ECCs to solve for the conditional probabilites in multi-class classification. There are three reasons to require the conditional probabilities:

1.
Probabilities provide useful extra information, specifically how accurate a given classification is, in absence of knowledge of its true value.
2.
The relationship between the binary probabilities and the multi-class probabilities derives uniquely and rigorously from probability theory.
3.
Binary classifiers that do not return calibrated probability estimates, but nonetheless supply a continuous decision function, are easy to recalibrate so that the decision function more closely resembles a probability [16, 17].

Two types of orthogonal ECCs along with three other multi-class methods-1 versus 1, 1 versus the rest, and random ECCs–will be tested on seven different datasets using three different binary classifiers–logistic regression, support vector machines (SVM), and piece-wise linear–to see how they compare in terms of classification speed, classification accuracy and accuracy of the conditional probabilities.

2 Algorithm

We wish to design a set of m binary classifiers, each of which return a decision function:

$$\begin{aligned} r_j(\mathbf {x}) = P_j(-1 | \mathbf {x}) - P_j(+1 | \mathbf {x}) \end{aligned}$$

where $P_j(c | \mathbf {x})$ is the conditional probability of the cth class of the jth classifier. Each binary classifier partitions a set of m classes such that for a given test point, $\mathbf {x}$:

$$\begin{aligned} \sum _{i=1}^m a_{ij} p_i = r_j;\quad j=[1,\ldots ,n] \end{aligned}$$

where $A=\lbrace a_{ij} \in \lbrace -1, +1 \rbrace \rbrace$ is a coding matrix for which each code partitions all of the classes and $p_i = p(i | \mathbf {x})$ is the conditional probability of the ith class. In vector notation:

$$\begin{aligned} A^T \mathbf {p} = \mathbf {r} \end{aligned}$$

(3)

This result derives from the fact that the class probabilities are additive [18]. The more general case where a class can be excluded, that is the coding may include zeroes, $a_{ij} \in \lbrace -1, 0, +1\rbrace$, will be treated in the next section.

Note that this assumes that the binary decision functions, $\mathbf {r}$, estimate the conditional probabilities perfectly. In practice there are a set of constraints that must be enforced because $\mathbf {p}$ is only allowed to take on certain values. Thus, we wish to solve the following minimization problem:

$$\begin{aligned}&\arg \min _{\mathbf {p}} | A^T \mathbf {p} - \mathbf {r} | \end{aligned}$$

(4)

$$\begin{aligned}&\sum _{i=1}^m p_i = 1 \end{aligned}$$

(5)

$$\begin{aligned}&p_i \ge 0; \quad i=[1,\ldots ,m] \end{aligned}$$

(6)

If A is orthogonal,

$$\begin{aligned} A A^T = n I \end{aligned}$$

where I is the $m \times m$ identity matrix, then the unconstrained minimization problem is easy to solve. Note that the voting solution in (2) is now equivalent to the inverse solution in (3). This allows us to determine the class easily, but we also wish to solve for the probabilities, $\mathbf {p}$, so that none of the constraints in (5) or (6) are violated.

The orthogonality property allows us to reduce the minimization problem in (4) to something much simpler:

$$\begin{aligned} \arg \min _{\mathbf {p}} | \mathbf {p} - \mathbf {p}_0 | \end{aligned}$$

where $\mathbf {p}_0 = A \mathbf {r}/n$ with the constraints in (5) and (6) remaining the same. Because the system has been rotated and expanded, the non-negativity constraints in (6) remain orthogonal, meaning they are independent: enforcing one by setting one of the probabilities to zero, $p_k=0$ for example, shouldn’t otherwise affect the solution. This still leaves the normalization constraint in (5): the problem, now strictly geometrical, is comprised of finding the point nearest $p_0$ on the diagonal hyper-surface that bisects the unit hyper-cube.

Briefly, we can summarize the algorithm as follows: (1) move to the nearest point that satisfies the normalization constraint, (5); (2) if one or more of the probabilities is negative, move to the nearest point that satisfies both the normalization constraint and the non-negativity constraints, (6), for the negative probabilities; (3) repeat step 2. More formally, let $\mathbf {1}$ be a vector of all 1’s:

$i:=0$; $m_0:=m$
while $\exists k \, p_{ik} < 0 \vee \mathbf {p}_i \cdot \mathbf {1} \ne 1$:
- if $\mathbf {p}_i \cdot \mathbf {1} \ne 1$ then $\mathbf {p}_{i+1} := \mathbf {p}_i + (\mathbf {p}_i \cdot \mathbf {1} - 1)/m_i$
- let K be the set of k such that $p_{i+1,k} < 0$
- for each $k \in K$:
  - $p_k:=0$
  - Remove k from the problem
- $m_{i+1}:=m_i-|K|$
- $i:=i+1$

Note that resultant direction vectors for each step form an orthogonal set. For instance, suppose $m_0=4$ and after enforcing the normalization constraint, the first probability is less than zero, $p_{1,1} < 0$, then the direction vectors for the two motions are:

$$\begin{aligned} \frac{1}{2}[1, 1, 1, 1] \cdot \frac{1}{2\sqrt{3}} [-3, 1, 1, 1] = 0 \end{aligned}$$

More generally, consider the following sequence of vectors:

$$\begin{aligned} v_{ij} = \frac{1}{\sqrt{(m-i)^2-2(m-i-1)}} \left\{ \begin{array}{ll} 0; &{}\quad j < i \\ -m+i+1; &{}\quad j=i \\ 1; &{}\quad j > i \end{array} \right. \end{aligned}$$

where $i \in [1, m]$ and $j \in [1, m]$. [19] A nice feature of this method, in addition to being fast, is that it is divided into two stages: a solution stage and a normalization stage.

3 Constructing the coding matrix

Finding an A such that $A A^T = n I$ and $a_{ij} \in \lbrace -1, 1, \rbrace$ is quite a difficult combinatorial problem. When zeros are added in, $a_{ij}\in \lbrace -1, 0, 1\rbrace$, it becomes even more difficult. Work in signal processing may be of limited applicability because coding matrices are typically comprised of 0’s and 1’s rather than $-1$’s and $+1$’s [20, 21]. In our case, a further restriction is that columns must contain both positive and negative elements, or:

$$\begin{aligned} \sum _{i=0}^m a_{ij} \ne \sum _{i=0}^m |a_{ij}|;\quad j=[1\ldots n] \end{aligned}$$

(7)

A simple method of designing an orthogonal A is using harmonic series. Consider the following matrix for six classes ($m=6$) and eight binary classifiers ($n=8$):

$$\begin{aligned} A = \left[ \begin{array}{rrrrrrrr} 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ -1 &{}\quad -1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 \\ -1 &{}\quad 1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad 1 \\ 1 &{}\quad 1 &{}\quad -1 &{}\quad -1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 \\ -1 &{}\quad 1 &{}\quad 1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 &{}\quad -1 \end{array} \right] \end{aligned}$$

(8)

This will limit the size of m relative to n; more precisely: $m \le \lfloor 2 \log _2 n \rfloor$. Moreover, only certain values of n will be admitted: $n=2^t$ where t is a whole number.

The first three rows in (8) comprise a Walsh-Hadamard code [22]: all possible permutations are listed. A square ($n=m$) orthogonal coding matrix is called a Hadamard matrix [23]. It can be shown that besides $n=1$ and $n=2$, only Hadamard matrices of size $n=4t$ exist, and it is still unproven that examples exist for all values of t [24]. A very simple, recursive method exists to generate matrices of size $n=t^2$ [24] but cannot be made to have the property in (7) since the matrix includes both a row and column of only ones. Such a matrix will include a “harmonic series” of the same type as in (8).

Two types of orthogonal coding matrices are tested in this note. The first type includes no zeros and is generated using a “greedy” algorithm. We choose n to be the smallest multiple of 4 equal to or larger than m. and start with an empty matrix. Candidate vectors containing both positive and negative elements are chosen at random to comprise a row of the matrix but never repeated. If the candidate vector is orthogonal to existing rows, then it is added to the matrix. New candidates are tested until the matrix is filled or we run out of permutations. A full matrix is almost always returned especially if $m<n$. The matrix is then checked to ensure that each column contains both positive and negative elements. Note that the whole process can be repeated as many times as necessary. An eight-class example follows:

$$\begin{aligned} A = \left[ \begin{array}{rrrrrrrr} 1 &{}\quad -1 &{}\quad 1 &{}\quad 1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad -1 \\ 1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad 1 \\ 1 &{}\quad -1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad -1 \\ 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad -1 &{}\quad -1 \\ 1 &{}\quad 1 &{}\quad 1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ -1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad -1 \\ 1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad -1 &{}\quad 1 \end{array} \right] \end{aligned}$$

This type of coding matrix can be solved using the algorithm described in Sect. 2, above.

Table 1 Table showing parameters chosen for the second type of orthogonal coding matrix: for the number of classes, m, the initial length of the code, $n_0$, and the number of non-zero values in each code, $|\mathbf {a}_i|$ ($i=1,\ldots ,m$), are given

Full size table

The other type of orthogonal coding matrix to be tested in this note includes zeros. The construction is similar except now the matrix is allowed to take on values of zero while the number of non-zero values (− 1 or $+$ 1) is kept fixed. A size is chosen for the matrix typically larger than the number of classes while the resulting matrix will normally be somewhat smaller since degenerate and fixed value columns (a correctly-trained binary classifier would always return the same value) are removed. The parameters chosen for each class size are shown in Table 1.

Coding matrices of this type were generated by pure, brute force with no attempt to track previous trials. An example coding matrix for six classes is shown below. Redundant columns have been bold out.

$$\begin{aligned} A = \left[ \begin{array}{rrrrrrrrrrrr} -1 &{}\quad 0 &{}\quad -1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad \mathbf{-\,1} &{}\quad \mathbf{0} &{}\quad \mathbf{0} &{}\quad \mathbf{-\,1} \\ 0 &{}\quad 1 &{}\quad -1 &{}\quad 0 &{}\quad -1 &{}\quad -1 &{}\quad 0 &{}\quad -1 &{}\quad \mathbf{-\,1} &{}\quad \mathbf{0} &{}\quad \mathbf{0} &{}\quad \mathbf{0} \\ -1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad -1 &{}\quad -1 &{}\quad \mathbf{0} &{}\quad \mathbf {-\,1} &{}\quad \mathbf {0} &{}\quad \mathbf{-\,1} \\ 1 &{}\quad 0 &{}\quad 0 &{}\quad -1 &{}\quad 0 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad \mathbf{0} &{}\quad \mathbf{-\,1} &{}\quad \mathbf{0} &{}\quad \mathbf{0} \\ 0 &{}\quad -1 &{}\quad -1 &{}\quad 0 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad 0 &{}\quad \mathbf {0} &{}\quad \mathbf{0} &{}\quad \mathbf{1} &{}\quad \mathbf{0} \\ 0 &{}\quad -1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad -1 &{}\quad 1 &{}\quad 0 &{}\quad \mathbf{0} &{}\quad \mathbf{-\,1} &{}\quad \mathbf{1} &{}\quad {\mathbf{0}} \end{array} \right] \end{aligned}$$

This type of orthogonal ECC is solved using a general, iterative, constrained, linear least-squares solver [25].

More work will need to be done to find efficient methods of generating these matrices if they are to be applied efficiently to problems with a large number of classes.

4 Results

Table 2 Total classification time, solution time, uncertainty coefficient and Brier score for seven different datasets using five different coding matrices: 1 versus 1, 1 versus the rest, randoms, orthogonal with no zeros, and orthogonal with zeros

Full size table

Table 3 Total classification time, solution time, uncertainty coefficient and Brier score for seven different datasets using five different coding matrices: 1 versus 1, 1 versus the rest, random, orthogonal with no zeros, and orthogonal with zeros

Full size table

Table 4 Solution time, uncertainty coefficient and Brier score for seven different datasets using five different coding matrices: 1 versus 1, 1 versus the rest, random, orthogonal with no zeros, and orthogonal with zeros

Full size table

Orthogonal error-correcting codes were tested on seven different datasets: two for digit recognition–“pendigits” [26] and “usps” [27]; the space shuttle control dataset–“shuttle” [28]; an urban land classification dataset–“urban” [29]; a similar one for satellite land classification–“sat”; a dataset for patterned image recognition–“segment”; and a dataset for vehicle recognition–“vehicle” [30]. The last three are borrowed from the “statlog” project [1, 28].

Two types of orthogonal ECCs were tested: the first type described in Sect. 3, with no zeros in the codes, and the second type which includes zeros. These were compared with three other methods: one-versus-one, one-versus-the-rest, and random ECCs with the same length of coding vector (number of columns), m, as the orthogonal matrices of the first type. The 1 versus rest multi-class as well as the random ECCs were solved using the same type of constrained linear least squares method as used for the second type of orthogonal ECC [25]. By enforcing the normality constraints using a Lagrange multiplier, 1 versus 1 may be solved with a simple (unconstrained) linear equation solver [31].

Three types of binary classifier were used: logistic regression [1], support vector machines [4], and a peicewise-linear classifer [3]. Logistic regression classifiers were trained using LIBLINEAR [32].

Support vector machines (SVMs) were trained using LIBSVM [33]. Partitions were trained separately then combined by finding the union of sets of support vectors for each partition. By indexing into the combined list of support vectors, the algorithms are optimized in both space and time [33]. For SVM, the same parameters were used for all multi-class methods and for all partitions (matrix columns). All datasets were trained using “radial basis function” (Gaussian) kernels of differing widths.

LIBSVM was also used to train an intermediate model from which an often faster piecewise-linear classifier [3] was trained. It was thought that this classifier would provide a better use-case for orthogonal ECCs than either of the other two. The single parameter for this algorithm–the number of border vectors–was set the same for each dataset as used in [3] for the 1 versus 1. For the other multi-class algorithms, the number of border vectors was doubled for small values (under 100) and increased by fifty percent for larger values to account for the more complex decision function created by using more classes in each binary classifier. Multi-class classifiers were designed, trained and applied using the framework provided within libAGF [3, 34, 35]

Results are shown in Tables 2, 3, and 4. Confidence limits represent standard deviations over 10 trials using different, randomly chosen coding matrices. For each trial, datasets were randomly separated into 70% training and 30% test. “U.C” stands for uncertainty coefficient, a skill score based on Shannon’s channel capacity that has many advantage over simple fraction of correct guesses or “accuracy” [34, 36, 37]. Probabilities are validated with the Brier score which is root-mean-square error measured against the truth of the class as a 0 or 1 value [16, 38].

For all of the datasets tested, orthogonal ECCs provide a small but significant improvement over random ECCs in both classification accuracy and in the accuracy of the conditional probabilities. This is in line with the literature as in [9, 14]. Improvements range from 0.4% to 17.5% relative (0.004 to 0.139 absolute) in uncertainty coefficient and 0.7% to 10.7% in Brier score. Results are also more consistent for the orthogonal ECCs as given by the calculated error bars.

Also as expected, solution times are extremely fast for the first type of orthogonal ECC. In many cases the times are an order-of-magnitude better than the next fastest method. Depending on the problem and classification method, this may or may not be significant. Since SVM is a relatively slow classifier, solution times are a minor portion of the total. For the logistic regression classifier, solving the constrained optimization problem for the probabilities typically comprises the bulk of classification times. Oddly, the solver for the 1 versus 1 method is the slowest by a wide margin, even though it’s a simple (unconstrained) linear solver [31]. This could potentially be improved by using a faster solver [37] or by employing the iterative method given in [31].

The two types of orthogonal ECCs were quite close in accuracy, with sometimes one taking the lead and sometimes the other. For the linear classifier, the second type was always more accurate while the first type was faster. Since it admits zeros, the decision boundaries are usually simpler–see below. For both the SVM and the piecewise linear classifier, skill scores were very similar, differing by at most 2.9% relative, 0.018 absolute, in U.C. and 17% in Brier score. For the SVM, the second type was faster while for the piecewise linear classifier, the first type was faster. The explanation for this follows.

Unfortunately, there is one method that is consistently more accurate than the orthogonal ECCs and this is 1 versus 1. The orthogonal ECCs only beat 1 versus 1 three times out of 21 for the uncertainty coefficient and one time out of 21 for the Brier score. Improvements in uncertainty coefficient range from insignificant to 0.6% relative or 0.004 absolute. The Brier score improved by 2.6%. Losses using linear classifiers were the worst, peaking at 14.6% relative, 0.203 absolute, in uncertainty coefficient and 50% in Brier score. The results for logistic regression provide a vivid demonstration as to why 1 versus 1 works so well: because it partitions the classes into “least-divisible units”, there are fewer training samples provided to each binary classifier, the decision boundary is simpler and a simpler classifier will work better

Nonetheless, there is a potential use case for our method. Although orthogonal ECCs are less accurate than 1 versus 1, they don’t lose much. If they are also faster, then a speed improvement may be worth a small hit in accuracy for some applications [3]. While 1 versus 1 beats orthogonal ECCs by a healthy margin using linear classifiers, the biggest loss in U.C. for SVM is only 1.5% relative, 0.011 absolute. Losses for Brier score are somewhat worse, peaking at 6.5%. Unfortunately, because the speed of a multi-class SVM is proportional mainly to the total number of support vectors [3], orthogonal ECCs rarely provide much of a speed advantage. What is needed is a constant-time–ideally very fast–non-linear classifier. This is where the piecewise-linear classifier comes in.

For uncertainty coefficient, 1 versus 1 was always better than orthogonal ECCs when using the piecewise-linear classifier. Losses peak at 1.9 % relative, 0.017 absolute. For the Brier score, only one of the seven datasets showed an improvement over 1 versus 1 at 4.9 %. The worst loss was 39 %. Improvements in speed range from 1.1 % to over 100 %. Much of the speed difference is simply the result of using fewer binary classifiers.

The purpose of the piecewise linear classifier is to improve the speed of the SVM. This speed increase is better with orthogonal ECCs than with 1 versus 1. Orthogonal ECCs applied to piecewise linear classifiers are faster than the the fastest SVM for five out of the seven datasets. Speed often trades off from accuracy. [3] provides a procedure for determining whether it’s worth switching algorithms or not. A similar analysis will not be repeated here due to time and space considerations, however whether any improvement in speed is worth the consequent hit in accuracy will depend on the application.

5 Conclusions

As predicted by recent literature, solving for multi-class using orthogonal ECCs was more accurate than the equivalent problem using random ECCs. Unfortunately, they were still unable to beat one-versus-one as an effective multi-class method. The author’s own work suggests that the 1 versus 1 classification almost always works well regardless of the dataset [35]. Hsu and Lin [8] find that 1 versus 1 outperform both 1 versus rest and random ECCs on a test of ten different datasets using SVM. One-versus-one is also used, often exclusively, with many statistical classification software packages.

There may still be room for further work, however, with the most likely fruitful line of inquiry being, first, on adaptive methods that use the data to figure out how best to go from binary to multi-class. In [35], for instance, even though 1 versus 1 was almost always most accurate, there was one dataset that benefitted from a more customized treatment. Recent work has focused on both empirically-designed decision trees [5, 6, 39] as well as empirically-designed ECCs [10,11,12]. Decision trees are the easiest to tackle because there are fewer possibilities and because a tree can be built from either the top down or the bottom up.

A second potential area for future work is in multi-class methods integrated with the base binary classifier, for instance with all the binary classifiers being trained simultaneously [8]. It stands to reason that more integrated multi-class methods would tend to be more accurate than those, such the ones disussed in this note, that treat the binary classifier as a “black box”, since there can now be sharing of information.

There is also a potential use case for orthogonal ECCs. If they are paired with a fast, non-linear binary classifier with better than O(N) performance, where N is the number of training samples, orthogonal ECCs should almost always be faster than 1 versus 1 while giving up little in accuracy. The algorithm presented here that solves for the probabilities is simple and elegant and may suggest new directions in the search for more efficient and accurate multi-class classification algorithms. Since it is fast it could help provide speed improvements for such applications as real-time computer vision, image processing, and voice-recognition.

References

Michie D, Spiegelhalter DJ, Tayler CC (1994) Machine learning, neural and statistical classification. Ellis Horwood series in artificial intelligence. Prentice Hall, Upper Saddle River
MATH Google Scholar
Herman GT, Yeung KTD (1992) On piecewise-linear classification. IEEE Trans Pattern Anal Mach Intell 14(7):782
Article Google Scholar
Mills P (2018) Solving for multi-class: a survey and synthesis. Real-Time Image Process. https://doi.org/10.1007/s11554-018-0769-9
Article Google Scholar
Müller KR, Mika S, Rätsch G, Tsuda K, Schölkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Networks 12(2):181
Article Google Scholar
Cheong S, Oh SH, Lee SY (2004) Support vector machine with binary tree architecture for multi-class classification. Neural Inf Process 2(3):47
Google Scholar
Lee JS, Oh IS (2003) In: Proceedings of the seventh international conference on document analysis and recognition (IEEE Computer Society). vol 2, pp. 770–774
Platt JC, Cristianini N, Shaw-Taylor J (2000) Large margin DAGs for multiclass classification. In: Solla S, Leen T, Mueller KR (eds) Advances in information processing, vol 12. MIT Press, Cambridge, pp 547–553
Google Scholar
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415
Article Google Scholar
Dietterich TG, Bakiri G (1995) Solving multiclass learning problems via error-correcting output codes. J Artif Intell Res 2:263
Article MATH Google Scholar
Crammer K, Singer Y (2002) On the learnability and design of output codes for multiclass problems. Mach Learn 47(2–3):201
Article MATH Google Scholar
Zhou J, Peng H, Suen CY (2008) Data-driven decomposition for multi-class classification. Pattern Recogn 41:67
Article MATH Google Scholar
Zhong G, Cheriet M (2013) Binary stochastic representations for large multi-class classification. In: Proceedings of the twenty-third international joint conference on artificial intelligence (IJCAI), pp 1932–1938
Allwein EL, Schapire RE, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113
MathSciNet MATH Google Scholar
Windeatt T, Ghaderi R (2002) Coding and decoding strategies for multi-class learning problems. Inf Fusion 4(1):11
Article Google Scholar
Zhou JT, Tsang IW, Ho SS, Mueller KR (2019) N-ary decomposition for multi-class classification. Mach Learn. https://doi.org/10.1007/s10994-019-05786-2
Article MathSciNet MATH Google Scholar
Jolliffe IT, Stephenson DB (2003) Forecast verification: a practitioner’s guide in atmospheric science. Wiley, Hoboken
Google Scholar
Platt J (1999) Advances in large margin classifiers. MIT Press, Cambridge
Google Scholar
Kong EB, Dietterich TG (1997) Probability estimation via error-correcting output coding. In: International conference on artificial intelligence and soft computing
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, New York
Book MATH Google Scholar
Hedayat AS, Sloane NJA, Stufken J (1999) Orthogonal arrays: theory and applications. Springer series in statistics. Springer, New York, chapter 4, pp 61–68
Panse MS, Mesham S, Chaware D, Raut A (2014) Error detection using orthogonal code. IOSR J Eng 4(3):2278
Article Google Scholar
Arora S, Barak B (2009) Computational complexity: a modern approach. Cambridge University Press, Cambridge
Book MATH Google Scholar
Sylvester JJ (1867) Thoughts on inverse orthogonal matrices, simultaneous sign successions, and tesselated pavements in two or more colours, with applications to Newton’s rule, ornamental tile-work, and the theory of numbers. Phil Mag 34:461
Article Google Scholar
Hedayat A, Wallis W (1978) Hadamard matrices and their applications. Ann Stat 6(6):1184
Article MathSciNet MATH Google Scholar
Lawson CL, Hanson RJ (1995) Solving least squares problems, classics in applied mathematics, vol 15. Society for Industrial and Applied Mathematics, Philadelphia
Book Google Scholar
Alimoglu F (1996) Combining multiple classifiers for pen-based handwritten digit recognition. Master’s thesis, Bogazici University
Hull JJ (1994) A database for handwritten text recognition research. IEEE Trans Pattern Anal Mach Intell 16(5):550
Article Google Scholar
King RD, Feng C, Sutherland A (1995) Statlog: comparision of classification problems on large real-world problems. Appl Artif Intell 9(3):289
Article Google Scholar
Johnson B (2013) High resolution urban land cover classification using a competititive multi-scale object-based approach. Remote Sens Lett 4(2):131
Article Google Scholar
Siebert J (1987) Vehicle recognition using rule-based methods. Turing Institute, Glasgow
Google Scholar
Wu TF, Lin CJ, Weng RC (2004) Probability estimates for multi-class classification by pairwise coupling. J Mach Learn Res 5:975
MathSciNet MATH Google Scholar
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871
MATH Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1
Article Google Scholar
Mills P (2011) Efficient statistical classification of satellite measurements. Int J Remote Sens 32(21):6109
Article Google Scholar
Mills P (2018) Solving for multi-class: a survey and synthesis. Tech Rep. http://arxiv.org/abs/1809.05929
Shannon CE, Weaver W (1963) The mathematical theory of communication. University of Illinois Press, Champaign
MATH Google Scholar
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C, 2nd edn. Cambridge University Press, Cambridge
MATH Google Scholar
Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78(1):1
Article Google Scholar
Benabdeslem K, Bennani Y (2006) Dendrogram-based SVM for multi-class classification. J Comput Inf Technol 14(4):283
Article Google Scholar

Download references

Acknowledgements

Thanks to Chih-Chung Chan and Chih-Jen Lin of the National Taiwan University for data from the LIBSVM archive and also to David Aha and the curators of the UCI Machine Learning Repository for statistical classification datasets. The LIBSVM software libraries can be found: https://www.csie.ntu.edu.tw/~cjlin/libsvm/. The LIBLINEAR software libraries can be found: https://www.csie.ntu.edu.tw/~cjlin/liblinear/. Software for performing multi-class classification using orthogonal error correcting codes, and many others, can be found: https://www.github.com/peteysoft/libmsci.

Author information

Authors and Affiliations

Cumberland, Canada
Peter Mills

Authors

Peter Mills
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Conflict of interest

The author declares that he has no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mills, P. Solving for multi-class using orthogonal coding matrices. SN Appl. Sci. 1, 1451 (2019). https://doi.org/10.1007/s42452-019-1437-9

Download citation

Received: 28 March 2019
Accepted: 04 October 2019
Published: 21 October 2019
DOI: https://doi.org/10.1007/s42452-019-1437-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Solving for multi-class using orthogonal coding matrices

Abstract

Similar content being viewed by others

Reliability Maps: A Tool to Enhance Probability Estimates and Improve Classification Accuracy

Comparison of Different Dichotomous Classification Algorithms

Optimisation of multiclass supervised classification based on using output codes with error-correcting

1 Introduction

2 Algorithm

3 Constructing the coding matrix

4 Results

5 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Solving for multi-class using orthogonal coding matrices

Abstract

Similar content being viewed by others

Reliability Maps: A Tool to Enhance Probability Estimates and Improve Classification Accuracy

Comparison of Different Dichotomous Classification Algorithms

Optimisation of multiclass supervised classification based on using output codes with error-correcting

Explore related subjects

1 Introduction

2 Algorithm

3 Constructing the coding matrix

4 Results

5 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation