1 Introduction

Many methods of statistical classication can only discriminate between two classes. Examples include lineear classifiers such as perceptrons and logistic regression [1], piecewise linear classifiers [2, 3], as well as support vector machines [4]. There are many ways of generalizing binary classification to multi-class and the number of possibilities increases exponentially with the number of classes.

One should distinguish between multi-class methods that use only a subset of the binary classifiers, adding more as the algorithm narrows down the class, and those that use all of the binary classifiers, combining the results or solving for the class probabilities. In the former category, we have hierarchical multi-class classifiers such as decision trees [5, 6] and decision directed acyclic graphs (DDACs) [7]. In the latter category, two common methods are one-versus-one (1 vs. 1) and one-versus-the-rest (1 vs. rest) [8]. These in turn generalize to error-correcting codes (ECCs) [9].

Early experiments with ECCs used random codes: the assumption is that if the codes are long enough (there are enough binary classifiers) they will adequately span the classes. Later work focused on optimizing the design of the codes: what type of codes will best span the classes and produce the most accurate results? Here we can also distinguish between two types: those that use the data to help design the codes [10,11,12] and those that are independent of the data but use the mathematical properties of the codes themselves to aid in their construction [13,14,15]. It is these latter type of optimized error-correcting codes we turn to in this note.

In error-correcting coding, there is a coding matrix, A, that specifies how the set of multiple classes is partitioned for each binary classifier. For a given column, if members of the jth class are to be labeled \(-1/+1\) for the binary classifier, then the jth row is assigned a \(-1/+1\). If the jth class is left out, then the jth row is assigned a 0. Typically, the class of the test point is determined by the distance between a row in the matrix and a vector of binary decision functions:

$$\begin{aligned} c(\mathbf {x}) = \arg \min _i | \mathbf {a}_i - \mathbf {r}(\mathbf {x}) | \end{aligned}$$
(1)

where \(\mathbf {a}_i\in \lbrace -1,0,+1 \rbrace\) is the ith row of the coding matrix and \(\mathbf {r}\) is a vector of decision functions at test point, \(\mathbf {x}\). If we take the upright brackets as a Euclidean distance we can expand (1) as follows:

$$\begin{aligned} c = \arg \min _i \sum _j \left( | \mathbf {a}_i | + | \mathbf {r} | - 2 \mathbf {a}_i \cdot \mathbf {r} \right) \end{aligned}$$

Since \(| \mathbf {r} |\) is constant over i, it may be removed from the expression. Also, for the purposes of this note, each row of A will be given the same number of non-zero entries, hence:

$$\begin{aligned} | \mathbf {a}_i | = | \mathbf {a}_j | = const. \end{aligned}$$

This is most evident for the case in which each binary classifier partitions all of the classes so that there are no zeros in A as is the case for the one-versus-the-rest partitioning. Then (1) reduces to a voting solution:

$$\begin{aligned} c = \arg \max A \mathbf {r} \end{aligned}$$
(2)

Both [13] and [14] show that to maximize the accuracy of an ECC, the distance between each row, \(|\mathbf {a}_i - \mathbf {a}_j|_{i \ne j}\), should be maximized. Using the above assumptions, this reduces to:

$$\begin{aligned} \min |\mathbf {a}_i \cdot \mathbf {a}_j|_{i \ne j} \end{aligned}$$

Note the absolute value prevents degenerate rows. In other words, the coding matrix, A, should be orthogonal.

In this note, we describe a fast and simple algorithm that uses orthogonal ECCs to solve for the conditional probabilites in multi-class classification. There are three reasons to require the conditional probabilities:

  1. 1.

    Probabilities provide useful extra information, specifically how accurate a given classification is, in absence of knowledge of its true value.

  2. 2.

    The relationship between the binary probabilities and the multi-class probabilities derives uniquely and rigorously from probability theory.

  3. 3.

    Binary classifiers that do not return calibrated probability estimates, but nonetheless supply a continuous decision function, are easy to recalibrate so that the decision function more closely resembles a probability [16, 17].

Two types of orthogonal ECCs along with three other multi-class methods-1 versus 1, 1 versus the rest, and random ECCs–will be tested on seven different datasets using three different binary classifiers–logistic regression, support vector machines (SVM), and piece-wise linear–to see how they compare in terms of classification speed, classification accuracy and accuracy of the conditional probabilities.

2 Algorithm

We wish to design a set of m binary classifiers, each of which return a decision function:

$$\begin{aligned} r_j(\mathbf {x}) = P_j(-1 | \mathbf {x}) - P_j(+1 | \mathbf {x}) \end{aligned}$$

where \(P_j(c | \mathbf {x})\) is the conditional probability of the cth class of the jth classifier. Each binary classifier partitions a set of m classes such that for a given test point, \(\mathbf {x}\):

$$\begin{aligned} \sum _{i=1}^m a_{ij} p_i = r_j;\quad j=[1,\ldots ,n] \end{aligned}$$

where \(A=\lbrace a_{ij} \in \lbrace -1, +1 \rbrace \rbrace\) is a coding matrix for which each code partitions all of the classes and \(p_i = p(i | \mathbf {x})\) is the conditional probability of the ith class. In vector notation:

$$\begin{aligned} A^T \mathbf {p} = \mathbf {r} \end{aligned}$$
(3)

This result derives from the fact that the class probabilities are additive [18]. The more general case where a class can be excluded, that is the coding may include zeroes, \(a_{ij} \in \lbrace -1, 0, +1\rbrace\), will be treated in the next section.

Note that this assumes that the binary decision functions, \(\mathbf {r}\), estimate the conditional probabilities perfectly. In practice there are a set of constraints that must be enforced because \(\mathbf {p}\) is only allowed to take on certain values. Thus, we wish to solve the following minimization problem:

$$\begin{aligned}&\arg \min _{\mathbf {p}} | A^T \mathbf {p} - \mathbf {r} | \end{aligned}$$
(4)
$$\begin{aligned}&\sum _{i=1}^m p_i = 1 \end{aligned}$$
(5)
$$\begin{aligned}&p_i \ge 0; \quad i=[1,\ldots ,m] \end{aligned}$$
(6)

If A is orthogonal,

$$\begin{aligned} A A^T = n I \end{aligned}$$

where I is the \(m \times m\) identity matrix, then the unconstrained minimization problem is easy to solve. Note that the voting solution in (2) is now equivalent to the inverse solution in (3). This allows us to determine the class easily, but we also wish to solve for the probabilities, \(\mathbf {p}\), so that none of the constraints in (5) or (6) are violated.

The orthogonality property allows us to reduce the minimization problem in (4) to something much simpler:

$$\begin{aligned} \arg \min _{\mathbf {p}} | \mathbf {p} - \mathbf {p}_0 | \end{aligned}$$

where \(\mathbf {p}_0 = A \mathbf {r}/n\) with the constraints in (5) and (6) remaining the same. Because the system has been rotated and expanded, the non-negativity constraints in (6) remain orthogonal, meaning they are independent: enforcing one by setting one of the probabilities to zero, \(p_k=0\) for example, shouldn’t otherwise affect the solution. This still leaves the normalization constraint in (5): the problem, now strictly geometrical, is comprised of finding the point nearest \(p_0\) on the diagonal hyper-surface that bisects the unit hyper-cube.

Briefly, we can summarize the algorithm as follows: (1) move to the nearest point that satisfies the normalization constraint, (5); (2) if one or more of the probabilities is negative, move to the nearest point that satisfies both the normalization constraint and the non-negativity constraints, (6), for the negative probabilities; (3) repeat step 2. More formally, let \(\mathbf {1}\) be a vector of all 1’s:

  • \(i:=0\); \(m_0:=m\)

  • while \(\exists k \, p_{ik} < 0 \vee \mathbf {p}_i \cdot \mathbf {1} \ne 1\):

    • if \(\mathbf {p}_i \cdot \mathbf {1} \ne 1\) then \(\mathbf {p}_{i+1} := \mathbf {p}_i + (\mathbf {p}_i \cdot \mathbf {1} - 1)/m_i\)

    • let K be the set of k such that \(p_{i+1,k} < 0\)

    • for each \(k \in K\):

      • \(p_k:=0\)

      • Remove k from the problem

    • \(m_{i+1}:=m_i-|K|\)

    • \(i:=i+1\)

Note that resultant direction vectors for each step form an orthogonal set. For instance, suppose \(m_0=4\) and after enforcing the normalization constraint, the first probability is less than zero, \(p_{1,1} < 0\), then the direction vectors for the two motions are:

$$\begin{aligned} \frac{1}{2}[1, 1, 1, 1] \cdot \frac{1}{2\sqrt{3}} [-3, 1, 1, 1] = 0 \end{aligned}$$

More generally, consider the following sequence of vectors:

$$\begin{aligned} v_{ij} = \frac{1}{\sqrt{(m-i)^2-2(m-i-1)}} \left\{ \begin{array}{ll} 0; &{}\quad j < i \\ -m+i+1; &{}\quad j=i \\ 1; &{}\quad j > i \end{array} \right. \end{aligned}$$

where \(i \in [1, m]\) and \(j \in [1, m]\). [19] A nice feature of this method, in addition to being fast, is that it is divided into two stages: a solution stage and a normalization stage.

3 Constructing the coding matrix

Finding an A such that \(A A^T = n I\) and \(a_{ij} \in \lbrace -1, 1, \rbrace\) is quite a difficult combinatorial problem. When zeros are added in, \(a_{ij}\in \lbrace -1, 0, 1\rbrace\), it becomes even more difficult. Work in signal processing may be of limited applicability because coding matrices are typically comprised of 0’s and 1’s rather than \(-1\)’s and \(+1\)’s [20, 21]. In our case, a further restriction is that columns must contain both positive and negative elements, or:

$$\begin{aligned} \sum _{i=0}^m a_{ij} \ne \sum _{i=0}^m |a_{ij}|;\quad j=[1\ldots n] \end{aligned}$$
(7)

A simple method of designing an orthogonal A is using harmonic series. Consider the following matrix for six classes (\(m=6\)) and eight binary classifiers (\(n=8\)):

$$\begin{aligned} A = \left[ \begin{array}{rrrrrrrr} 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ -1 &{}\quad -1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 \\ -1 &{}\quad 1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad 1 \\ 1 &{}\quad 1 &{}\quad -1 &{}\quad -1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 \\ -1 &{}\quad 1 &{}\quad 1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 &{}\quad -1 \end{array} \right] \end{aligned}$$
(8)

This will limit the size of m relative to n; more precisely: \(m \le \lfloor 2 \log _2 n \rfloor\). Moreover, only certain values of n will be admitted: \(n=2^t\) where t is a whole number.

The first three rows in (8) comprise a Walsh-Hadamard code [22]: all possible permutations are listed. A square (\(n=m\)) orthogonal coding matrix is called a Hadamard matrix [23]. It can be shown that besides \(n=1\) and \(n=2\), only Hadamard matrices of size \(n=4t\) exist, and it is still unproven that examples exist for all values of t [24]. A very simple, recursive method exists to generate matrices of size \(n=t^2\) [24] but cannot be made to have the property in (7) since the matrix includes both a row and column of only ones. Such a matrix will include a “harmonic series” of the same type as in (8).

Two types of orthogonal coding matrices are tested in this note. The first type includes no zeros and is generated using a “greedy” algorithm. We choose n to be the smallest multiple of 4 equal to or larger than m. and start with an empty matrix. Candidate vectors containing both positive and negative elements are chosen at random to comprise a row of the matrix but never repeated. If the candidate vector is orthogonal to existing rows, then it is added to the matrix. New candidates are tested until the matrix is filled or we run out of permutations. A full matrix is almost always returned especially if \(m<n\). The matrix is then checked to ensure that each column contains both positive and negative elements. Note that the whole process can be repeated as many times as necessary. An eight-class example follows:

$$\begin{aligned} A = \left[ \begin{array}{rrrrrrrr} 1 &{}\quad -1 &{}\quad 1 &{}\quad 1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad -1 \\ 1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad 1 \\ 1 &{}\quad -1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad -1 \\ 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad -1 &{}\quad -1 \\ 1 &{}\quad 1 &{}\quad 1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ -1 &{}\quad -1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ -1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad -1 \\ 1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad -1 &{}\quad 1 \end{array} \right] \end{aligned}$$

This type of coding matrix can be solved using the algorithm described in Sect. 2, above.

Table 1 Table showing parameters chosen for the second type of orthogonal coding matrix: for the number of classes, m, the initial length of the code, \(n_0\), and the number of non-zero values in each code, \(|\mathbf {a}_i|\) (\(i=1,\ldots ,m\)), are given

The other type of orthogonal coding matrix to be tested in this note includes zeros. The construction is similar except now the matrix is allowed to take on values of zero while the number of non-zero values (− 1 or \(+\) 1) is kept fixed. A size is chosen for the matrix typically larger than the number of classes while the resulting matrix will normally be somewhat smaller since degenerate and fixed value columns (a correctly-trained binary classifier would always return the same value) are removed. The parameters chosen for each class size are shown in Table 1.

Coding matrices of this type were generated by pure, brute force with no attempt to track previous trials. An example coding matrix for six classes is shown below. Redundant columns have been bold out.

$$\begin{aligned} A = \left[ \begin{array}{rrrrrrrrrrrr} -1 &{}\quad 0 &{}\quad -1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad \mathbf{-\,1} &{}\quad \mathbf{0} &{}\quad \mathbf{0} &{}\quad \mathbf{-\,1} \\ 0 &{}\quad 1 &{}\quad -1 &{}\quad 0 &{}\quad -1 &{}\quad -1 &{}\quad 0 &{}\quad -1 &{}\quad \mathbf{-\,1} &{}\quad \mathbf{0} &{}\quad \mathbf{0} &{}\quad \mathbf{0} \\ -1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad -1 &{}\quad -1 &{}\quad \mathbf{0} &{}\quad \mathbf {-\,1} &{}\quad \mathbf {0} &{}\quad \mathbf{-\,1} \\ 1 &{}\quad 0 &{}\quad 0 &{}\quad -1 &{}\quad 0 &{}\quad -1 &{}\quad -1 &{}\quad 1 &{}\quad \mathbf{0} &{}\quad \mathbf{-\,1} &{}\quad \mathbf{0} &{}\quad \mathbf{0} \\ 0 &{}\quad -1 &{}\quad -1 &{}\quad 0 &{}\quad -1 &{}\quad 1 &{}\quad -1 &{}\quad 0 &{}\quad \mathbf {0} &{}\quad \mathbf{0} &{}\quad \mathbf{1} &{}\quad \mathbf{0} \\ 0 &{}\quad -1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad -1 &{}\quad 1 &{}\quad 0 &{}\quad \mathbf{0} &{}\quad \mathbf{-\,1} &{}\quad \mathbf{1} &{}\quad {\mathbf{0}} \end{array} \right] \end{aligned}$$

This type of orthogonal ECC is solved using a general, iterative, constrained, linear least-squares solver [25].

More work will need to be done to find efficient methods of generating these matrices if they are to be applied efficiently to problems with a large number of classes.

4 Results

Table 2 Total classification time, solution time, uncertainty coefficient and Brier score for seven different datasets using five different coding matrices: 1 versus 1, 1 versus the rest, randoms, orthogonal with no zeros, and orthogonal with zeros
Table 3 Total classification time, solution time, uncertainty coefficient and Brier score for seven different datasets using five different coding matrices: 1 versus 1, 1 versus the rest, random, orthogonal with no zeros, and orthogonal with zeros
Table 4 Solution time, uncertainty coefficient and Brier score for seven different datasets using five different coding matrices: 1 versus 1, 1 versus the rest, random, orthogonal with no zeros, and orthogonal with zeros

Orthogonal error-correcting codes were tested on seven different datasets: two for digit recognition–“pendigits” [26] and “usps” [27]; the space shuttle control dataset–“shuttle” [28]; an urban land classification dataset–“urban” [29]; a similar one for satellite land classification–“sat”; a dataset for patterned image recognition–“segment”; and a dataset for vehicle recognition–“vehicle” [30]. The last three are borrowed from the “statlog” project [1, 28].

Two types of orthogonal ECCs were tested: the first type described in Sect. 3, with no zeros in the codes, and the second type which includes zeros. These were compared with three other methods: one-versus-one, one-versus-the-rest, and random ECCs with the same length of coding vector (number of columns), m, as the orthogonal matrices of the first type. The 1 versus rest multi-class as well as the random ECCs were solved using the same type of constrained linear least squares method as used for the second type of orthogonal ECC [25]. By enforcing the normality constraints using a Lagrange multiplier, 1 versus 1 may be solved with a simple (unconstrained) linear equation solver [31].

Three types of binary classifier were used: logistic regression [1], support vector machines [4], and a peicewise-linear classifer [3]. Logistic regression classifiers were trained using LIBLINEAR [32].

Support vector machines (SVMs) were trained using LIBSVM [33]. Partitions were trained separately then combined by finding the union of sets of support vectors for each partition. By indexing into the combined list of support vectors, the algorithms are optimized in both space and time [33]. For SVM, the same parameters were used for all multi-class methods and for all partitions (matrix columns). All datasets were trained using “radial basis function” (Gaussian) kernels of differing widths.

LIBSVM was also used to train an intermediate model from which an often faster piecewise-linear classifier [3] was trained. It was thought that this classifier would provide a better use-case for orthogonal ECCs than either of the other two. The single parameter for this algorithm–the number of border vectors–was set the same for each dataset as used in [3] for the 1 versus 1. For the other multi-class algorithms, the number of border vectors was doubled for small values (under 100) and increased by fifty percent for larger values to account for the more complex decision function created by using more classes in each binary classifier. Multi-class classifiers were designed, trained and applied using the framework provided within libAGF [3, 34, 35]

Results are shown in Tables 2, 3, and 4. Confidence limits represent standard deviations over 10 trials using different, randomly chosen coding matrices. For each trial, datasets were randomly separated into 70% training and 30% test. “U.C” stands for uncertainty coefficient, a skill score based on Shannon’s channel capacity that has many advantage over simple fraction of correct guesses or “accuracy” [34, 36, 37]. Probabilities are validated with the Brier score which is root-mean-square error measured against the truth of the class as a 0 or 1 value [16, 38].

For all of the datasets tested, orthogonal ECCs provide a small but significant improvement over random ECCs in both classification accuracy and in the accuracy of the conditional probabilities. This is in line with the literature as in [9, 14]. Improvements range from 0.4% to 17.5% relative (0.004 to 0.139 absolute) in uncertainty coefficient and 0.7% to 10.7% in Brier score. Results are also more consistent for the orthogonal ECCs as given by the calculated error bars.

Also as expected, solution times are extremely fast for the first type of orthogonal ECC. In many cases the times are an order-of-magnitude better than the next fastest method. Depending on the problem and classification method, this may or may not be significant. Since SVM is a relatively slow classifier, solution times are a minor portion of the total. For the logistic regression classifier, solving the constrained optimization problem for the probabilities typically comprises the bulk of classification times. Oddly, the solver for the 1 versus 1 method is the slowest by a wide margin, even though it’s a simple (unconstrained) linear solver [31]. This could potentially be improved by using a faster solver [37] or by employing the iterative method given in [31].

The two types of orthogonal ECCs were quite close in accuracy, with sometimes one taking the lead and sometimes the other. For the linear classifier, the second type was always more accurate while the first type was faster. Since it admits zeros, the decision boundaries are usually simpler–see below. For both the SVM and the piecewise linear classifier, skill scores were very similar, differing by at most 2.9% relative, 0.018 absolute, in U.C. and 17% in Brier score. For the SVM, the second type was faster while for the piecewise linear classifier, the first type was faster. The explanation for this follows.

Unfortunately, there is one method that is consistently more accurate than the orthogonal ECCs and this is 1 versus 1. The orthogonal ECCs only beat 1 versus 1 three times out of 21 for the uncertainty coefficient and one time out of 21 for the Brier score. Improvements in uncertainty coefficient range from insignificant to 0.6% relative or 0.004 absolute. The Brier score improved by 2.6%. Losses using linear classifiers were the worst, peaking at 14.6% relative, 0.203 absolute, in uncertainty coefficient and 50% in Brier score. The results for logistic regression provide a vivid demonstration as to why 1 versus 1 works so well: because it partitions the classes into “least-divisible units”, there are fewer training samples provided to each binary classifier, the decision boundary is simpler and a simpler classifier will work better

Nonetheless, there is a potential use case for our method. Although orthogonal ECCs are less accurate than 1 versus 1, they don’t lose much. If they are also faster, then a speed improvement may be worth a small hit in accuracy for some applications [3]. While 1 versus 1 beats orthogonal ECCs by a healthy margin using linear classifiers, the biggest loss in U.C. for SVM is only 1.5% relative, 0.011 absolute. Losses for Brier score are somewhat worse, peaking at 6.5%. Unfortunately, because the speed of a multi-class SVM is proportional mainly to the total number of support vectors [3], orthogonal ECCs rarely provide much of a speed advantage. What is needed is a constant-time–ideally very fast–non-linear classifier. This is where the piecewise-linear classifier comes in.

For uncertainty coefficient, 1 versus 1 was always better than orthogonal ECCs when using the piecewise-linear classifier. Losses peak at 1.9 % relative, 0.017 absolute. For the Brier score, only one of the seven datasets showed an improvement over 1 versus 1 at 4.9 %. The worst loss was 39 %. Improvements in speed range from 1.1 % to over 100 %. Much of the speed difference is simply the result of using fewer binary classifiers.

The purpose of the piecewise linear classifier is to improve the speed of the SVM. This speed increase is better with orthogonal ECCs than with 1 versus 1. Orthogonal ECCs applied to piecewise linear classifiers are faster than the the fastest SVM for five out of the seven datasets. Speed often trades off from accuracy. [3] provides a procedure for determining whether it’s worth switching algorithms or not. A similar analysis will not be repeated here due to time and space considerations, however whether any improvement in speed is worth the consequent hit in accuracy will depend on the application.

5 Conclusions

As predicted by recent literature, solving for multi-class using orthogonal ECCs was more accurate than the equivalent problem using random ECCs. Unfortunately, they were still unable to beat one-versus-one as an effective multi-class method. The author’s own work suggests that the 1 versus 1 classification almost always works well regardless of the dataset [35]. Hsu and Lin [8] find that 1 versus 1 outperform both 1 versus rest and random ECCs on a test of ten different datasets using SVM. One-versus-one is also used, often exclusively, with many statistical classification software packages.

There may still be room for further work, however, with the most likely fruitful line of inquiry being, first, on adaptive methods that use the data to figure out how best to go from binary to multi-class. In [35], for instance, even though 1 versus 1 was almost always most accurate, there was one dataset that benefitted from a more customized treatment. Recent work has focused on both empirically-designed decision trees [5, 6, 39] as well as empirically-designed ECCs [10,11,12]. Decision trees are the easiest to tackle because there are fewer possibilities and because a tree can be built from either the top down or the bottom up.

A second potential area for future work is in multi-class methods integrated with the base binary classifier, for instance with all the binary classifiers being trained simultaneously [8]. It stands to reason that more integrated multi-class methods would tend to be more accurate than those, such the ones disussed in this note, that treat the binary classifier as a “black box”, since there can now be sharing of information.

There is also a potential use case for orthogonal ECCs. If they are paired with a fast, non-linear binary classifier with better than O(N) performance, where N is the number of training samples, orthogonal ECCs should almost always be faster than 1 versus 1 while giving up little in accuracy. The algorithm presented here that solves for the probabilities is simple and elegant and may suggest new directions in the search for more efficient and accurate multi-class classification algorithms. Since it is fast it could help provide speed improvements for such applications as real-time computer vision, image processing, and voice-recognition.