Introduction

In many real applications of machine learning, we often need to consider the trade-off between multiple conflicting objectives. For instance, measures like accuracy and architectural complexity are clearly two different (possibly conflicting) criteria. This issue can be tackled by considering a multi-objective decision making approach.

There are two main approaches to dealing with multi-objective decision making. The weighted sum approach, which consists of transforming the original multi-objective problem into a single-objective problem using a weighted formula; the Pareto approach, which considers directly the original multi-objective problem and searches for non-dominated solutions, that is, solutions that are not worse than any other solution with respect to all criteria.

In a weighted sum approach, a multi-objective problem is transformed into a single-objective problem by a numerical weight function that is assigned to objectives and then values of the weighted criteria are combined into a single value according to the weights. One of the reasons for its popularity is its simplicity. However, there are several drawbacks associated to it. First, the definition of weights in these formulas is often ad hoc or requires great domain knowledge which might not be available. Second, the optimal solution strongly depends on that particular weight function, which misses the opportunity to find other models that might be actually more interesting to the user, for instance, representing a better trade-off between different criteria. Third, a weighted formula involving a linear combination of different criteria is meaningless in many scenarios, as the criteria may be non-commensurable (comparison of apples and oranges).

In the Pareto approach, instead of transforming a multi-objective problem into a single-objective problem and then solving it using a single-objective decision making, a multi-objective algorithm is used to solve the original multi-objective problem. The advantage of the Pareto approach is that it can cope with any kind of non-commensurable criteria. Recently, there has been an increasing interest in the development of new learning methods able to cope simultaneously with multi-objective criteria using Pareto optimality [1,2,3,4]. The disadvantage comes from the power of the Pareto approach in situations where a good weight function can be devised, as the Pareto approach is more conservative than using the weighted sum idea. In this work, we assume that a good weight function is not available. Consider for instance the work in [3], which proposes a multi-objective Pareto-based optimization method for simultaneous optimization of architectural complexity and accuracy for Polynomial Neural Networks (PNN). Using multiple data sets, they compare their method with the state-of-art method for learning PNN, producing the results presented in Table 1.

Table 1 Architectural complexity and accuracy of two learning methods for PNN [3] Higher values are better

Based on Table 1, the authors [3] claim that a multi-objective approach (jointly optimizing architectural complexity and accuracy) is clearly beneficial. Can we say that their method is clearly better than the state of art for both criteria and also for each of them independently? For which criterion is it superior (respectively inferior)? To answer these questions, we need a method that statistically assesses whether an algorithm is better than another in terms of all criteria. To the best knowledge of the authors, this method is lacking in machine learning and so it could not be used in [3].

Competing methods/algorithms are typically compared by means of a statistical test, whose aim is to assess whether an algorithm is significantly better than another (statistically comparing their performance on different data sets or problem instances). For comparing two algorithms over a collection of data sets, the most common approaches are the sign test or the Wilcoxon signed-rank test [5]; however, these tests are only able to cope with one performance measure (criterion) at a time, that is, they cannot consider a multi-objective approach without resorting to the weighted sum approach described earlier. In this paper, we develop two tests that are able to cope jointly with multiple performance measures without having to somehow combine them: a frequentist procedure based on the generalized likelihood ratio test and a Bayesian procedure based on a multinomial-Dirichlet conjugate model. We further extend them by discovering conditional independences among measures to reduce the number of parameters of such models, an important add-on since usually the number of data sets on which methods are compared is limited. Applications of these new tests are numerous. Here, we use data from a comparison of general purpose classification methods to show a clear practical application of the tests.

This work is an extension of the paper presented at AMBN 2015 [6]. We generalize those ideas to deal with the comparison of multiple algorithms at once, that is, we design new statistical tests that can compare multiple algorithms under multiple performance measures. The idea is to label each ordering of the algorithms into a category in our parameter space, so we can identify which is the most probable ordering and whether such most probable ordering is significantly more probable than the others.

The paper is divided as follows. Section 2 defines the dominance statement that we later use to design the statistical tests. Sections 3 and 4 present, respectively, our frequentist and Bayesian statistical tests for dealing with multiple measures jointly. Section 5 shows how to use Bayesian networks to improve the estimation of the joint distributions and thus increase the quality of the results of the tests. Section 6 describes how the designed tests can be easily adapted to deal with multiple algorithms and multiple measures jointly. Section 7 describes and presents our experiments with synthetic data, and finally Sect. 8 concludes the work and points out directions for future research.

Joint Analysis of Performance Criteria

Let \(M_1,\dots ,M_m\) be a set of m performance measures (criteria) and assume that we are going to compare two algorithms A and B by jointly using these measures.

Definition 1

We call a ‘dominance statement’ for B against A a sequence of m dominance conditions:

$$\begin{aligned} D^{(BA)}=[\succ ,\succ ,\prec ,\dots ,\succ ], \end{aligned}$$

where the comparison \(\succ \) (or \(\prec \)) in the ith entry of the vector \(D^{(BA)}\) means that algorithm B is better than A (respectively, A is better than B) on measure \(M_i\).□

Our goal is to make inferences on dominance statements by evaluating the m performance measures for the algorithms A and B on n different case studies (for instance, data sets, problem instances, etc.). In other words, we want to decide which \(D^{(BA)}\) is the most appropriate for A and B given tables with values \(M^{(\mathrm {Alg})}_{ij}\) representing the jth measure for the algorithm \(\mathrm {Alg}\in \{A,B\}\) in the ith case study:

$$\begin{aligned} {\mathbf {M}}^{(\mathrm {Alg})}=\left[ \begin{array}{cccc} M^{(\mathrm {Alg})}_{11} & M^{(\mathrm {Alg})}_{12} & \dots & M^{(\mathrm {Alg})}_{1m}\\ M^{(\mathrm {Alg})}_{21} & M^{(\mathrm {Alg})}_{22} & \dots & M^{(\mathrm {Alg})}_{2m}\\ \vdots & \vdots & \vdots & \vdots \\ M^{(\mathrm {Alg})}_{n1} & M^{(\mathrm {Alg})}_{n2} & \dots & M^{(\mathrm {Alg})}_{nm}\\ \end{array}\right] . \end{aligned}$$
(1)

Given the matrix of performances \({\mathbf {M}}^{(A)}\) and \({\mathbf {M}}^{(B)}\), we first build the binary matrix \({\mathbf {X}}=[{\mathbf {M}}^{(B)}\succ {\mathbf {M}}^{(A)}]\), whose entry \(x_{ij}\) is equal to one if algorithm B is better than algorithm A for the jth measure in the ith case study and zero otherwise. We assume that ties do not exist.Footnote 1 To each matrix \({\mathbf {X}}\), we associate a count vector \({\mathbf {n}}\), whose entries represent the counts for each one of the \(2^m\) possible dominance statements (many of which might be zero).

Example 1

Consider the comparison of two algorithms in terms of accuracies \(M_1\) (expressed in percent values in the first row) and time \(M_2\) (in seconds, shown in the second row) on 12 data sets:

$$\begin{aligned} {\mathbf {M}}^A= & \left[ \begin{array}{cccccccccccc} 85 & 87 & 87 & 91 & 91 & 91 & 94 & 94 &94 &94 &94 &94\\ 8 & 11 & 11 & 12 & 12 &12 & 16 & 16 &16 &16 &16 &16 \\ \end{array}\right] ^T,\nonumber \\ {\mathbf {M}}^B= & \left[ \begin{array}{cccccccccccc} 84 & 86 & 86 & 92 & 92 & 92 & 95 & 95 &95 &95 &95 &95\\ 9 & 10 & 10 & 13 & 13 &13 & 15 & 15 &15 &15 &15 &15 \\ \end{array}\right] ^T\, \end{aligned}$$
(2)

where \(^T\) denotes transpose.

The matrix \({\mathbf {X}}=[{\mathbf {M}}^{(B)}\succ {\mathbf {M}}^{(A)}]\) is:Footnote 2

$$\begin{aligned} {\mathbf {X}}=\left[ \begin{array}{cccccccccccc} 0 & 0 & 0 & 1 & 1 &1 & 1 &1 &1 &1 &1 &1 \\ 0 & 1 & 1 & 0 & 0 &0 & 1 &1 &1 &1 &1 &1\\ \end{array}\right] ^T. \end{aligned}$$
(3)

Hence, we derive that the dominance statement \([\prec ,\prec ]\) (or [0, 0]), which means that B is worse than A on both measures, is observed \(n_0=1\) time; the statement \([\prec ,\succ ]\) (or [0, 1]), which means that B is worse than A on the first measure but better on the second, is observed \(n_1=2\) times; the statement \([\succ ,\prec ]\) (or [1, 0]) is observed \(n_2=3\) times; the statement \([\succ ,\succ ]\) (or [1, 1]) is observed \(n_3=6\) times. Hence, we have that \({\mathbf {n}}=[1,2,3,6]\) (a binary lexicographic order is used for the entries of \({\mathbf {n}}\)).□

The matrix \({\mathbf {X}}\) or, equivalently, the vector \({\mathbf {n}}\), includes all the information that we will use to derive our tests. While this approach might seem to lose information because we only account for the sign of each difference \(M^{(\mathrm {Alg})}_{ij}-M^{(\mathrm {Alg}')}_{ij}\), there is no effective way of using the actual value of the difference across multiple measures if these measures are assumed to be expressed in incomparable units, as in this case no procedure could be used to compare the measures jointly or to collapse the measures into a single one to run standard tests (using some weighting function; we assume that normalizing the measures is not an option either, as it entails an additional assumption about the measures which might not hold). On the other hand, the sign of the difference is a proper comparable value among measures regardless of the particular meaning of each of them. In fact, we point out that the measures \(M^{(\mathrm {Alg})}_{ij}\) can themselves be obtained from any arbitrary procedure (including statistical tests), as we only assume that the sign of the difference \(M^{(\mathrm {Alg})}_{ij}-M^{(\mathrm {Alg}')}_{ij}\) is available (and we properly account for ties). This provides us with a very general setting, allowing for numerous applications.

Generalized Likelihood Ratio Test

We derive a simple null hypothesis significance test for the joint analysis of performance measures. We denote by \(\theta _k\), for \(k=0,\dots ,2^m-1\), the probability of obtaining one of the \(2^m\) possible dominance statements.Hence, \(\theta _k\ge 0\) and \(\sum _{k=0}^{2^m-1}\theta _k=1\). We have enumerated the dominance statements according to their “binary order”, so that \(\theta _0\) is the probability of the statement \([\prec ,\dots ,\prec ,\prec ]\), \(\theta _1\) is the probability of \([\prec ,\dots ,\prec , \succ ]\), \(\theta _2\) is the probability of \([\prec ,\dots ,\prec ,\succ , \prec ]\), etc. Our goal is to find if there is a statement that is significantly more likely than all others based on the observation matrix \({\mathbf {X}}\). It is clear that \({\mathbf {n}}\) is a sufficient statistic for this test, since its kth entry \(n_k\) corresponds to the counts for the kth statement. Hence, to achieve our goal, we can perform a Generalized Likelihood Ratio Test (GLRT):

$$\begin{aligned} \lambda ({\mathbf {n}})=\frac{\max _{\varvec{\theta }\in \Theta ^*}L(\varvec{\theta }|{\mathbf {n}})}{\max _{\varvec{\theta }\in \Theta }L(\varvec{\theta }|{\mathbf {n}})},\quad \text { where }\, L(\varvec{\theta }|{\mathbf {n}})=\prod \limits _{k=0}^{2^{m}-1}\theta _k^{n_k}, \end{aligned}$$
(4)

\(\varvec{\theta }=[\theta _0,\dots ,\theta _{2^m-1}]\), \(\Theta \) is the simplex for \(\varvec{\theta }\), \(\Theta ^*=\{\varvec{\theta }\in \Theta :~\theta _{i^*}\le \max ({\varvec{\theta }}{\setminus }\theta _{i^*})\}\) (we abuse notation and indicate by \({\varvec{\theta }}{\setminus }\theta _{i^*}\) all thetas apart from \(\theta _{i^*}\)) and \(i^*={{\mathrm{argmax}}}_{i=0,\dots ,2^m-1} n_i\). The rationality behind Eq. (4) is that we are testing two hypothesis: (\(H_0\)) \(\theta _{i^*}\le \max ({\varvec{\theta }}{\setminus }\theta _{i^*})\) and (\(H_1\)) \(\theta _{i^*}>\max ({\varvec{\theta }}{\setminus }\theta _{i^*})\). Under \(H_0\), the value of \(\varvec{\theta }\) which better explains the observations is the maximum likelihood estimate (MLE) subject to the constraint that \(\varvec{\theta }\in \Theta ^*\). Its likelihood is the numerator of Eq. (4). The value of \(\varvec{\theta }\) which maximizes the likelihood is instead the MLE subject to \(\varvec{\theta }\in \Theta \). It is clear that \(0 \le \lambda ({\mathbf {n}}) \le 1\). GLRT employs \(\lambda ({\mathbf {n}})\) as a test statistic and rejects \(H_0\) for small values of \(\lambda ({\mathbf {n}})\), that is, when \(\lambda ({\mathbf {n}}) \le \rho \), where the value of \(\rho \) is determined by fixing the type I error to be \(\alpha \). By Wilks’ theorem, for large n, \(-2\log ( \lambda ({\mathbf {n}}))\) is Chi-square distributed with one degree of freedom [7, 8]. Hence, the rejection zone for the null hypothesis is approximately equal to

$$\begin{aligned} \mathcal {R}=\left\{ {\mathbf {n}}:~-2\log (\lambda ({\mathbf {n}}))> \chi ^2_{1,\alpha }\right\} , \end{aligned}$$
(5)

where \(\alpha \) is the confidence level. Therefore, to apply GLRT, we must only compute \(\lambda ({\mathbf {n}})\).

Theorem 1

Given the count vector \({\mathbf {n}}\), it holds that

$$\begin{aligned} \lambda ({\mathbf {n}})=\frac{\left( \frac{n_a+n_b}{2}\right) ^{n_a+n_b}}{n_a^{n_a}n_b^{n_b}}, \end{aligned}$$
(6)

where \(n_a\) is the greatest value among \(n_0,\dots ,n_{2^m-1}\) and \(n_b\) the second greatest.

Proof

The maximum likelihood estimate of \(\varvec{\theta }\) subject to the constraint \(\varvec{\theta } \in \Theta \) is

$$\begin{aligned} \left( \frac{n_0}{n},\frac{n_1}{n},\dots ,\frac{n_{2^m-1}}{n}\right) , \end{aligned}$$

in fact the only constraint on \(\varvec{\theta }\) in this case is that its elements sum up to 1. The maximum likelihood estimate of \(\varvec{\theta }\) subject to the constraint \(\Theta ^*=\{\varvec{\theta }\in \Theta :~\theta _{i^*}\le \max ({\varvec{\theta }}{\setminus }\theta _{i^*})\}\) can be computed using KKT conditions of optimality for optimization problems subject to inequality constraints [9]. To obtain this estimate let us assume without loss of generality that \(n_0\ge n_1 \ge n_2\cdots \), and so the constraint is \(\theta _0\le \theta _1\). To facilitate the derivation, let us work with the log-likelihood function (it has the same maximum). The KKT conditions are

$$\begin{aligned} \frac{n_0}{\theta _0} = \mu + \nu \, \quad \text { and } \quad \frac{n_1}{\theta _1} = \mu - \nu \, \quad \text { and } \quad \forall k\le 2:~ \frac{n_k}{\theta _k} = \mu \, \quad \text { and } \quad \nu \cdot (\theta _0-\theta _1)=0. \end{aligned}$$

Multiplying each one by \(\theta _k\) and summing them we obtain \(n=\mu + \nu \cdot (\theta _0-\theta _1)=\mu \), and the maximum happens when \(\theta _0=\theta _1=\frac{n_0+n_1}{2n}\). So we have that the maximum likelihood estimate of \(\varvec{\theta }\) is

$$\begin{aligned} \left( \frac{n_c}{n},\frac{n_c}{n},\frac{n_2}{n},\dots ,\frac{n_{2^m-1}}{n}\right) , \end{aligned}$$

where \(n_c=(n_0+n_1)/2\). Then the likelihood ratio is

$$\begin{aligned} \frac{\left( \tfrac{n_c}{n}\right) ^{n_0}\cdot \left( \tfrac{n_c}{n}\right) ^{n_1}\cdots \left( \tfrac{n_{2^m-1}}{n}\right) ^{n_0}}{\left( \tfrac{n_0}{n}\right) ^{n_0}\cdot \left( \tfrac{n_1}{n}\right) ^{n_1}\cdots \left( \tfrac{n_{2^m-1}}{n}\right) ^{n_0}}= \frac{n_c^{n_0+n_1}}{n_0^{n_0}n_1^{n_1}}, \end{aligned}$$

which proves the theorem.□

In case \(n_a=n_b\), we have \(\lambda ({\mathbf {n}})=1\) and \(-2\log ( \lambda ({\mathbf {n}}))=0\), so that the null hypothesis can never be rejected. It can be shown that:

Theorem 2

The GLRT (Eq. (5)) is (asymptotically) calibrated for a prescribed significance level \(\alpha \) obtaining the maximum type I error when \(n_a+n_b=n\).□

This can be proven using an approach similar to that described in [10, Ex. 21.2].

Example 2

In Example 1, \(m=2\) and Eq. (2) yields \(L(\varvec{\theta }|{\mathbf {n}})=\theta _0 \theta _1^2 \theta _2^3 \theta _3^6\), where \(\theta _0\) is the probability of the statement \([\prec ,\prec ]\), \(\theta _1\) of \([\prec ,\succ ]\), \(\theta _2\) of \([\succ ,\prec ]\) and \(\theta _3\) of \([\succ ,\succ ]\). Hence, \(n_a=6\), \(n_b=3\), the statistic \( \lambda ({\mathbf {n}})=\tfrac{(\frac{9}{2})^9}{3^36^6}\approx 0.6\) and the p value is 0.313. Given the value of the p value, we cannot conclude that B is better than A on both measures.□

GLRTs have the disadvantage that they do not provide the probability of the hypotheses, but only its p value under \(H_0\). This means that we do not have any information about the probability of the alternative hypothesis being true. To address this issue, in the next section, we propose a Bayesian hypothesis test for testing a certain dominance statement. On the other hand, GLRT is extremely fast when compared to the Bayesian test.

Bayesian Test

We implement the Bayesian hypothesis test by following a Bayesian estimation approach, that is, by estimating the posterior probability of the vector of parameters \(\varvec{\theta }\). Given the count vector \({\mathbf {n}}\), the likelihood of \(\varvec{\theta }\) given the data are given by the right-hand side of Eq. (4), which is a multinomial distribution. As prior we then consider a Dirichlet distribution: \(p(\varvec{\theta })\propto \prod \nolimits _{k=0}^{2^{m}-1}\theta _k^{\alpha _k-1}\), where \(\alpha _k>0\) are the parameters of the Dirichlet distribution. In the rest of the paper, we will always use the symmetric prior \(\alpha _k=1/2^{m}\) (however, we can also use other priors such as the Jeffreys prior \(\alpha _k=\frac{1}{2}\), or some robust prior model [11]). By conjugacy, the posterior is also a Dirichlet with updated parameters \(n_k+\alpha _k\). In the Bayesian setting, to make inferences on a dominance statement, we have to simply compute the posterior probabilities \(P(\theta _i>\max ({\varvec{\theta }}{\setminus }\theta _{i})|{\mathbf {n}})\), for \(i=0,\dots ,2^{m}-1\). This is the posterior probability that \(\theta _i\) (associated with the i-statement) is greater than all other \(\theta _{\lnot i}\) values.□

Proposition 1

It holds that \(\sum \limits _{i=0}^{2^{m}-1} P(\theta _i>\max ({\varvec{\theta }}{\setminus }\theta _{i})|{\mathbf {n}})=1\).

This result follows from the simple fact that \(P(\theta _i=\theta _j|{\mathbf {n}})=0\) (i.e., since \(\theta _i\) are continuous variables, it is clear that \(P(\theta _i=\theta _j|{\mathbf {n}})=0\) since any probability density function on continuous variables assigns probability zero to singletons). Hence, the above posterior probabilities consider all the available information on the dominance statements. These probabilities can easily be computed by Monte Carlo sampling on the space of vectors \(\varvec{\theta }\) from the posterior Dirichlet distribution and then by counting the fraction of times we see \(\theta _i>\max ({\varvec{\theta }}{\setminus }\theta _{i})\), for every i.

Example 3

Take again Example 1. We already know that \(L(\varvec{\theta }|{\mathbf {n}})=\theta _0 \theta _1^2 \theta _2^3 \theta _3^6\), where \(\theta _0\) is the probability of the statement \([\prec ,\prec ]\), \(\theta _1\) of \([\prec ,\succ ]\), \(\theta _2\) of \([\succ ,\prec ]\) and \(\theta _3\) of \([\succ ,\succ ]\). The posterior probabilities of hypotheses are: \(P(\theta _0>\theta _{\lnot 0}|{\mathbf {n}})\approx 0.013\), \(P(\theta _1>\theta _{\lnot 1}|{\mathbf {n}}) \approx 0.051\), \(P(\theta _2>\theta _{\lnot 2}|{\mathbf {n}}) \approx 0.136\), and \(P(\theta _3>\theta _{\lnot 3}|{\mathbf {n}}) \approx 0.80\). Hence, the most probable dominance statement is \([\succ ,\succ ]\) and its probability is 0.8. These probabilities have been computed by Monte Carlo sampling as discussed above.□

Bayesian Network

The columns of \({\mathbf {X}}=[{\mathbf {M}}^{(B)}\succ {\mathbf {M}}^{(A)}]\) can be seen as binary random variables \(\mathcal {M}=\{M_1,\ldots ,M_m\}\) representing which algorithm is better according to that measure. Because of possible stochastic conditional independences between these variables, the estimation of a joint probability \(p({\mathcal M})\) can be improved using a Bayesian network (BN). A BN can be defined as a triple \(({\mathcal G},{\mathcal M},{\mathcal P})\), where \({\mathcal G}=(V_{\mathcal G},E_{\mathcal G})\) is a directed acyclic graph (DAG) with \(V_{\mathcal G}\) a collection of m nodes associated to the random variables \({\mathcal M}\) (a node per variable), and \(E_{\mathcal G}\) a collection of arcs; \({\mathcal P}\) is a collection of conditional probabilities \(p(M_i|\textit{PA}_i)\) where \(\textit{PA}_i\) denotes the parents of \(M_i\) in the graph (\(\textit{PA}_i\) may be empty), corresponding to the relations of \(E_{\mathcal G}\). In a Bayesian network, the Markov condition states that every variable is conditionally independent of its non-descendants given its parents. This structure induces a joint probability distribution by the factorization \(p(M_1,\dots ,M_m) = \prod _i p(M_i|\textit{PA}_i)\). Let \({{\varvec{\theta }}}\) be the entire vector of parameters such that \(\theta _{ijk}=p(M_i=k|\textit{PA}_i={j})\), where \(k\in \{0,1\}\), \(j\in \{1,...,2^{|\textit{PA}_i|}\}\) and \(i\in \{1,\ldots ,m\}\). Note that this represents a different parametrization with respect to the \({\varvec{\theta }}\) of previous sections, but a simple transformation can be used to compute those values through the factorization expression. Given the table \(\mathbf{X}\) with m measures and n case studies, the structure learning problem in Bayesian networks is to find a DAG \({\mathcal G}\) that maximizes its posterior probability, that is, \({\mathcal G}^*={{\mathrm{argmax}}}_{{\mathcal G}\in {\varvec{{\mathcal G}}}} {p({\mathcal G}|\mathbf{X})}\), with \({\varvec{{\mathcal G}}}\) the set of all DAGs over node set \({\mathcal M}\).

$$\begin{aligned} p({\mathcal G}|\mathbf{X}) \propto p({\mathcal G})\cdot \int p(\mathbf{X}|{\mathcal G},{\varvec{\theta }})\cdot p({\varvec{\theta }}|{\mathcal G}) \mathrm{d}{\varvec{\theta }}, \end{aligned}$$

where \(p({\varvec{\theta }}|{\mathcal G})\) is the prior of \({\varvec{\theta }}\) for a given graph \({\mathcal G}\), assumed to be a symmetric Dirichlet with positive hyper-parameter \(\alpha ^*\):

$$\begin{aligned} p({\varvec{\theta }}|{\mathcal G}) =\prod _{i=1}^m\prod _{j=1}^{2^{|\textit{PA}_i|}}\Gamma \left( \frac{\alpha ^*}{2^{|\textit{PA}_i|}}\right) \prod _{k=0}^{1}\frac{\theta _{ijk}^{ \frac{\alpha ^*}{2^{|\textit{PA}_i|+1}}-1}}{\Gamma \left( \frac{\alpha ^*}{2^{| \textit{PA}_i|+1}}\right) }. \end{aligned}$$
(7)

\(\alpha ^*\) is usually referred to as the Equivalent Sample Size (ESS). Such computation is known as the Bayesian–Dirichlet Equivalent Uniform (BDeu) criterion [12, 13], where we assume parameter independence and modularity [14]. We also assume \(\alpha ^*=1\) and that there is no preference for any graph and set \(p({\mathcal G})\) as uniform.

To find the graph representing the best set of conditional independences over the space of all possible DAGs \({\varvec{{\mathcal G}}}\), multiple approaches have been proposed in the literature. Because the number of measures is hardly above 15–20 and they are all binary, the combination of properties of the BDeu score [15] with a dynamic programming algorithm [16] usually suffices. Otherwise, one might use more sophisticated ideas [17,18,19], which can deal with a greater number of variables. Given the optimal graph \({\mathcal G}\), we can employ the discovered conditional independences to write the joint distribution for \(\mathcal {M}\) opportunely:

$$\begin{aligned} p(\mathbf{X}|{\mathcal G},{{\varvec{\theta }}}) =\prod _{i=1}^m\prod _{j=1}^{2^{|\textit{PA}_i|}} \theta _{ij0}^{n_{ij0}}(1-\theta _{ij0})^{n_{ij1}}, \end{aligned}$$
(8)

where \(n_{ijk}\) counts the number of times \((M_i=k\wedge \textit{PA}_i=j)\) in the data. Combined with the prior \(p({\varvec{\theta }}|{\mathcal G})\) of Eq. (7), this can be used to compute \(P(\theta _i>\max ({\varvec{\theta }}{\setminus }\theta _{i})|{\mathbf {X}})\) by Monte Carlo sampling as before (even if different from previous sections, the parametrization of \({\varvec{\theta }}\) used here also works for that). The advantages of using Bayesian networks are as follows. First, using the \(p({\mathcal G}|\mathbf{X})\), the dependence model underlying the distribution is automatically adapted to what can be inferred from data, and so one usually needs fewer observations to learn a good model than when working with the full joint. Second, the graph can be used to identify relations between measures and how they are associated, which can be for instance used to ignore measures that are not able to help in discriminating the algorithms. Third, computations can be carried out efficiently (at least when we restrict ourselves to a couple of tens of variables, i.e., performance measures). We will illustrate these benefits later on.

Comparing Multiple Algorithms

The results so far dealt with the comparison between two algorithms under multiple performance measures. In this section, we generalize such approach to compare multiple algorithms. To do so, we must define an extended dominance statement, where each element corresponds to an ordering of the goodness of the desired algorithms. Let \(M_1,\dots ,M_m\) be a set of m performance measures (criteria) and assume that we are going to compare l algorithms \(A_1,\ldots ,A_l\) by jointly using these measures.

Definition 2

We call a ‘dominance statement’ for algorithms \(A_1,\ldots ,A_l\) a sequence of m dominance conditions:

$$\begin{aligned} D^{(A_1,\ldots ,A_l)}=[o_1,o_2,o_3,\dots ,o_m], \end{aligned}$$

where the value \(o_i\) in the ith entry of the vector \(D^{(A_1,\ldots ,A_l)}\) is an integer from 0 to \(l!-1\) indicating a permutation of the algorithms \(A_1,\ldots ,A_l\) which respects their performances according to measure \(M_i\). For simplicity, we sort the l! possible permutations in lexicographical order and assign an integer accordingly.□

In short, the elements \(o_i\) of the vector \(D^{(A_1,\ldots ,A_l)}\) tell us what is the order of goodness (best to worst) among the algorithms being compared for that measure. Ties are dealt as explained before. Using this definition, the matrix \({\mathbf {X}}\) defined earlier can now be built directly with the values from \(D^{(A_1,\ldots ,A_l)}\), which become the rows of \(\mathbf {X}\) (one row for each dataset). Everything proceeds as before, but now the state space for the generalized likelihood ratio test and for the Bayesian test are not anymore of size \(2^m\) but \((l!)^m\) instead. Hence, \(\theta _k\ge 0\) and \(\sum _{k=0}^{(l!)^m-1}\theta _k=1\), and there is an obvious correspondence between each \(\theta _k\) and the order which it represents (this can be inferred from the value k). All definitions and derivations in Sects. 3, 4 and 5 can be easily adapted by employing this extended space of parameters. For instance, Expressions (7) and (8) need to account for variables of the Bayesian network taking on (l!) values instead of two. Moreover, an interesting property of this extension to multiple algorithms is that the marginal models (that is, projecting the joint distribution for multiple algorithms into any two particular algorithms) yield exactly the same results as the simplified formulation created for only two algorithms, so this idea generalizes that version described earlier in a sound manner. For the sake of simplicity, we will not present the results of Sects. 34 and 5 again, since they are not particularly depending on the number of categories in the state space, and hence they trivially work for this new situation presented here with multiple algorithms.

Fig. 1
figure 1

ROC curves for the GLRT (gray dashed-dotted) and the Bayesian test with (black dashed) and without (black contiguous) the Bayesian network use during learning. Distributions and data (n samples) are generated for a domain with m measures

Experiments

We perform a simulated study to understand the benefit of using the Bayesian networks. We study scenarios with m equal to 2 and 3 measures and with 3 algorithms (that is, \(l=3\)) from which we uniformly draw at random the multinomial parameters, that is, \((3!)^2-1=35\) and \((3!)^3-1=215\) independent parameters, respectively. This reflects a scenario of full dependence between measures (with probability 1). We label each test case (that is, each draw) as follows: if the maximum \(\theta \) is greater than the second greatest plus 0.1%, then this is labeled as a case where there is a difference between the maximum and the others. Otherwise we say the maximum is not greater than the others (and we force the maximum and second greatest to be equal to each other). Then we randomly generate n samples (\(n=\)50, 100 or 200) from the distribution and run the GLRT and the Bayesian test with and without the support of the Bayesian network to learn the underlying distribution from data. For each test case, we record the probability that the maximum parameter is greater than the others (or the p value in the case of the GLRT). This procedure is repeated one thousand times for cases where the maximum is greater (so positive cases) and one thousand times with the maximum equal to the second greatest value (negative cases). The results over these two thousand test cases are used to build a receiver operating characteristic (ROC) curve according to the usual procedure: true/false positive/negative are defined by varying the threshold for the probability (or respectively the p value) such that the method takes a decision of whether it is a positive or negative case. In this way, we obtain the percentage (over two thousand test cases) of true positive, true negative, false positive and false negative for each method for each threshold. The curves with the GLRT (gray dashed-dotted) and the Bayesian test with the Bayesian network (black dashed) and without it (black contiguous) are shown in Fig. 1 for different values of m and n. In general, the GLRT is equal or inferior to the Bayesian test, and the Bayes test with the Bayesian network version is usually superior to the Bayesian test alone. We notice that in some cases the curves are barely better than random guess (which would correspond to the identity function in the ROC graph). This happens because there are too many parameters to learn in the multinomial with respect to the amount of data. We see that with the increase of data (\(n=200\)) and independent measures the curve begins to improve with respect to random guesses, because the true models are simpler and hence need fewer data to be learned.

Fig. 2
figure 2

ROC curves for the GLRT (gray dashed-dotted) and the Bayesian test with (black dashed) and without (black contiguous) the Bayesian network use during learning. Distributions and data (n samples) are generated for a domain with m measures uniformly at random assuming that all measures are independent from each other

We repeat the experiment but we now assume that the measures are independent of each other. In this scenario, we expect the method with the Bayesian network to be superior than using the full joint distribution, as it can estimate a more appropriate distribution (given the limited amount of data). The idea is that the Bayesian network can learn the fact that the measures are independent (this fact is not disclosed to the methods, as in practice we usually would not know it beforehand). Again we uniformly draw at random the parameters of the multinomial (respecting the independence assumption among all measures), then we draw the data and we label the cases as before. The ROC curves for this scenario are shown in Fig. 2. Again, the Bayesian test with the Bayesian network achieves the best curves.

Table 2 shows the area under the curves for each method and scenario when comparing the three simulated algorithms. The values obtained by GLRT are inferior to those of the Bayesian test. The latter has consistently produced better results with the support of the Bayesian network for learning the distribution. The superiority of the method with the Bayesian network is justified by the better estimation of the joint distribution with its underlying independence assessments. We notice in Table 2 that the Bayesian network version loses to one of the others when data were generated from the full distribution and the amount of data is not so small. The reasoning is that, in these cases, the data were enough to fit well the joint distribution directly without the need of the Bayesian network.

Table 2 Area under the ROC curve for each method in each scenario when comparing 3 algorithms

We repeat the experiment with only two algorithms, and record the area under the ROC curve as before. Results are presented in Table 3. With only two algorithms the results are even more clear in favor of the Bayesian test with the support of the Bayesian network.

Table 3 Area under the ROC curve for each method in each scenario when comparing 2 algorithms

Data of Classification Problems

In this section, we apply our tests to compare classifiers on 80 data sets (10 runs of tenfold cross-validation) and using several performance measures. We start by comparing classifiers two by two, and later we demonstrate the case of more than two classifiers being compared at once. We have considered the following classifiers ‘AODE’ (\(C_1\)), ‘Bayes net’ (\(C_2\)), ‘Bayes.NaiveBayes’ (\(C_3\)), ‘trees.J48graft’ (\(C_4\)), ‘trees.RandomForest’ (\(C_5\)) and ‘logistic’ (\(C_6\)). We have performed all the experiments using WEKA [20], which implements all such classifiers, and analyzed the results using simple scripts in R. We note that our purpose is not to conclude in favor or against any of the classifiers, but to illustrate the use of our new approaches to compare them. The data and measures used in the analysis are available at http://www.cs.qub.ac.uk/~c.decampos/ngc2016/. As illustration, we compare the classifiers using accuracy, F-measure and weighted-AUC: (1) separately; (2) considering pairwise combinations of these measures; (3) considering the three measures together.

For the case of accuracy and weighted-AUC, Matrix (9) (on the left) reports the results of the comparison obtained considering separately each of these measures (each cell contains the result for accuracy on top and weighted-AUC below it), while Matrix (9) (on the right) is the result of the Bayesian joint test. For performing the separate tests, we have used the Wilcoxon signed-rank test [5]. The numerical values in the Matrix (9) (on the left) are the p values of Wilcoxon signed-rank test computed on the direction (\(\prec \) or \(\succ \)) corresponding to the highest value of the statistic (most likely direction to refute the null hypothesis). For instance, the meaning of the comparison \(C_1\) versus \(C_5\) is as follows: \(C_1\) has been found worse than \(C_5\) in accuracy (with p value 0.17) and better in weighted-AUC (with p value 0.14). All pairwise comparisons with p values less than \(\alpha /2\) (e.g., \(\alpha =0.1\) or 0.05) are significant.Footnote 3 Matrix (9) (on the right) reports the comparison performed with the Bayesian test considering jointly accuracy and weighted-AUC. In this case, each entry of the matrix represents the most probable joint dominance statement and the numerical value is the relative probability. Comparing the two matrices, there are two cases where the tests are in clear contradiction (in bold) and a case (\(C_4\) vs. \(C_6\)) where the joint comparison gives an evident advantage in power. This means that \(C_4\) is better than \(C_6\) jointly on both accuracy and weighted-AUC, while this is not true when the two performance measures are considered separately. Therefore, it is evident that decisions derived by a joint test can be very different from the decisions carried out using a separate test for each performance measure. If the goal is to compare algorithms considering jointly the measures, then it is more appropriate to use the new methods proposed here. The GLRT is overall consistent with the results obtained by the Bayesian test (results not shown). For instance, its p value for “\(C_4\) better than \(C_6\) on both the performance measures” is almost zero (so “very” significant). The choice between GLRT and the Bayesian test depends on the user’s needs.

(9)

Now we consider weighted-AUC and F-measure together. Matrix (10) (on the left) reports the results of the comparison based on separate tests (each cell contains the result for weighted-AUC on top and F-measure below it), while Matrix (10) (on the right) regards the Bayesian joint test. There are five cases where the tests are in contradiction (in bold). In particular, in the comparisons \(C_2\) vs. \(C_5\) and \(C_3\) vs. \(C_5\), the Bayesian test asserts that \(C_5\) is jointly better with probability 0.91, while the separate tests do not find a significant dominance. Again for \(C_4\) vs. \(C_6\), it is evident that the joint comparison gives an advantage in power.

(10)

Finally, we consider the three performance measures together. Matrix (11) reports the result of the Bayesian joint test.

(11)

We can then assert that \(C_1\) is better than \(C_2\) and \(C_3\) jointly on all performance measures. Overall, \(C_5\) appears to be jointly the best classifier followed by \(C_4\). Using the Bayesian network inference to compare \(C_4\) and \(C_5\), we achieve the very same conclusions (results not shown). The interesting outcome of that inference is that we can graphically see the relation between measures in Fig. 3, which is automatically learned from the matrix of measures, and not surprisingly, all three measures of classification accuracy are dependent.

Fig. 3
figure 3

Three measures used to compare \(C_4\) and \(C_5\) and their (in)dependences

Finally, we have run the joint tests using all three measures and classifiers \(C_1,C_2,C_3\) all at once, as described in Sect. 6. GLRT and the Bayesian test with and without the Bayesian network have all concluded the following orderings: for accuracy and F-measure, we have \(C_1\succ C_2\succ C_3\), while for weighted-AUC we have \(C_1\succ C_3\succ C_2\). These orderings are the same that we have found when performing the pairwise analysis, as can be seen in the first entries of Matrix (11). This result confirms the previous analysis and makes it much stronger, since it states that those are the most probable orderings among the classifiers when joint analysis of measures and classifiers was conducted. Because \(C_2\) is better than \(C_3\) over two measures but worse than it over another, in a Pareto sense, we could say that \(C_1\) is the best classifier, but \(C_2\) and \(C_3\) are not dominant over each other. This kind of situation demonstrates the importance of performing the correct analysis for the problem at hand. Such decision is nevertheless dependent on the objectives of the study and should be taken case by case, since no unique methodology fits all problems.

Conclusions

Comparing algorithms under multiple measures is typically performed using independent statistical tests. In this paper, we have developed new statistical tests that are able to compare multiple algorithms at once and consider all the performance measures jointly. This allows us, for example, to make statements such as a classifier is jointly better than another on multiple measures as well as on particular subsets of measures. These subsets of measures can be identified with the use of a Bayesian network, i.e., by modelling the (in)dependences among measures. With artificial data examples we have shown that the decisions derived by a joint test can be very different from the decisions carried out using a separate test for each performance measure. We argue that the ideas developed here can offer a new way for comparing algorithms using multiple performance measures. A clear drawback is that we cannot compare too many algorithms at the same time, because the complexity grows exponentially in the number of algorithms being compared. As future work, we intend to overcome this issue by learning an appropriate space of ordering instead of considering all possible ordering among the algorithms (which is factorial in the number of algorithms). We also intend to explore applications and the further use of the Bayesian network structure to understand the relations between performance measures and their importance for the evaluation of algorithms.