1 Introduction

The need to quantitatively model order structures and make inference about them is present in many fields: rank manipulation in statistics (Marden 1995), preference modelling in multi-criteria decision making (Grabisch and Labreuche 2008), preference learning (Fürnkranz and Hüllermeier 2010), decision theory, etc.

Orders being complex structures, information about them is often incomplete or uncertain. However, while the need to consider partial observations of complete orders in the previously mentioned fields have been quickly acknowledged, quantitative methods still focus mainly on representing and inferring complete orderings (this is in contrast with more qualitative representations, such as CP-nets, Boutilier et al. 2004, that are tailored to model partial orders).

It is only recently that the task of inferring partial orders has gained some interest in fields such as learning to rank (Cheng et al. 2010, 2012) or learning multi-criteria aggregation functions (Labreuche 2010) (note that this task of inferring partial orders is quite different from trying to choose a unique representation from partial informations, Greco et al. 2011). Even in these cases, partial orders are seen as incomplete but reliable inferences concerning ideals underlying complete orders, about which we have insufficient information.

This paper takes a quite different perspective, as partial orders are here considered as the primary objects, meaning that linear orders are merely a special case of the presented framework. More precisely, this work proposes to use belief functions bearing on the relation between pairs of objects and to infer information about partial orders from them. It provides generic as well as pragmatic tools to perform inferences. Belief functions indeed provide interesting uncertainty models to model partial information about orders: they allow to mix imprecise observations and belief degrees in a single setting, thus formally putting sets and probabilities under a common umbrella. One of the main contributions of this paper is to provide a framework where the incomparability relation of partial orders is explicitly modelled (in our case as a specific focal element), a feature that, as far as we know, is not present in works dealing with quantitative models of orders.

There are only few other works that deal with belief functions defined over partial orders. Among them, the work of  Tritchler and Lockwood (1991) is probably the oldest one, but it remains very theoretical (not providing practical inference tools) and does not explicitly model incomparability. Utkin (2009) also proposes belief functions to work with partial orders, but considers frequencies of pairwise comparisons between groups of objects (i.e., imprecise observations) and no incomparability, while we consider pairwise comparisons between single objects and models incomparability, without necessarily referring to a frequentist interpretation.

The main framework we use is presented in Sect. 3, and is then illustrated in Sect. 4 on two problems of machine learning: label ranking and multilabel classification. Section 2 recalls the necessary elements of belief function theory.

2 Belief functions

The Theory of Belief Functions (also referred to as Dempster–Shafer or Evidence Theory) has been proposed by Shafer (1976) as a general model of uncertainties. By mixing probabilistic and set-valued representations, it allows to represent degrees of belief and incomplete information in a unified framework, which makes it adequate to model uncertainty about orders or preferences between objects. Indeed, preferences are most often partially observed and may be subject to various uncertainties (e.g., a decision maker can be quite uncertain about her/his preferences, or the preferences between different agents may be based on quite different amount of data).

Let us consider an uncertain variable \(\omega \) taking values in a finite and unordered set \(\Omega \) (in our case, this will be the set of asymmetric relations bearing on a set \(\Lambda =\{\lambda _1,\ldots ,\lambda _c\}\) of \(c\) objects) called the frame of discernment. Within belief function theory, the belief or the information regarding the actual value taken by \(\omega \) is represented by a mass function Shafer (1976) and Smets and Kennes (1994) defined as a function \(m^\Omega \) from \(2^\Omega \) to [0, 1], verifying

$$\begin{aligned} \sum _{A \subseteq \Omega } m^\Omega (A) = 1, \end{aligned}$$
(1)

and

$$\begin{aligned} m^\Omega (\emptyset )=0. \end{aligned}$$
(2)

The notation \(m^\Omega \) will be simplified to \(m\) when there is no ambiguity about the frame of discernment. The sets \(A\) of \(\Omega \) such that \(m(A)>0\) are called focal sets of \(m\). Each focal set \(A\) represents a set of possible values for \(\omega \), and the quantity \(m(A)\) can be interpreted as a fraction of a unit mass of belief, which is allocated to \(A\) on the basis of a given evidential corpus. Complete ignorance corresponds to \(m(\Omega )= 1\) (vacuous mass), and perfect knowledge of the value of \(\omega \) is represented by the certain mass assigning the whole mass of belief to a unique singleton of \(\Omega \). A mass function is said to be logical if it has only one focal set. A mass function is simple if it has at most two focal sets, including \(\Omega \).

Any mass function can be equivalently represented by a belief function \(\text {bel}\), a plausibility function \(\text {pl}\) and a commonality function \(\text {q}\) defined, respectively, by

$$\begin{aligned} \text {bel}(A) \triangleq \sum _{ B \subseteq A} m(B), \end{aligned}$$
(3)
$$\begin{aligned} \text {pl}(A) \triangleq \sum _{B \cap A \ne \emptyset } m(B), \end{aligned}$$
(4)
$$\begin{aligned} \text {q}(A) \triangleq \sum _{B \supseteq A} m(B), \end{aligned}$$
(5)

for all \(A \subseteq \Omega .\) The belief function quantifies how much event \(A\) is implied by the information, as it sums up masses of sets included in \(A\) and whose mass is necessarily allocated to \(A\). The plausibility function quantifies how much event \(A\) is consistent with our information, as it sums masses that do not contradict \(A\), i.e., whose intersection with \(A\) is non-empty. The commonality is harder to interpret, but possesses mathematical properties that we will exploit in this paper. The first of them is that for any singleton \(\omega \in \Omega \), we have \(\text {pl}(\{\omega \})=\text {q}(\{\omega \})\).

Two mass functions \(m_1\) et \(m_2\) on \(\Omega \) representing two distinct pieces of evidence may be combined by Dempster’s rule of combination. The resulting mass function \(m_\oplus = m_1 \oplus m_2\) is given by

$$\begin{aligned} m_\oplus (A) \triangleq \frac{1}{1-K} \sum _{B \cap C=A} m_1(B) m_2(C), \end{aligned}$$
(6)

with \(K\), called the degree of conflict, defined as the mass \(m(\emptyset )\) given to the empty set, i.e.,

$$\begin{aligned} K \triangleq \sum _{B \cap C=\emptyset } m_1(B) m_2(C). \end{aligned}$$
(7)

Dempster’s rule of combination may be equivalently expressed using commonalities as

$$\begin{aligned} q_\oplus (A) = \displaystyle {\frac{1}{1-K}} ~ \text {q}_1(A) \text {q}_2(A) \quad \forall A \subseteq \Omega , \end{aligned}$$
(8)

where \(\text {q}_1(A)\) and \(\text {q}_2(A)\) are, respectively, the commonalities associated to \(m_1\) and \(m_2\). This provides a very easy means to express the result of Dempster’s rule on events \(A \subseteq \Omega \), and we will make heavy use of it in the next sections.

3 Partial order prediction in the framework of belief functions using a pairwise approach

3.1 Problem statement, notations

Let \(\Lambda =\{\lambda _1,\ldots ,\lambda _c\}\) denote the set of \(c\) possible objects (labels, alternatives, etc.). A binary relation \(R \subseteq \Lambda \times \Lambda \) is asymmetric if \((\lambda _i, \lambda _j) \in R \Rightarrow (\lambda _j, \lambda _i) \not \in R\), and we will denote by \(\mathcal {A}\) the set of asymmetric relations defined over \(\Lambda \).

A strict partial order over \(\Lambda \) is a binary relation that is transitiveFootnote 1 and asymmetric, and we will denote by \(\mathcal {R}\) the set of such partial orders over \(\Lambda \). In this latter case \((\lambda _i, \lambda _j) \in R\), also written \(\lambda _i \succ _R \lambda _j\), indicates that \(\lambda _i\) is strictly higher than (preferred to) \(\lambda _j\). If neither \(\lambda _i \succ _R \lambda _j\) nor \(\lambda _j \succ _R \lambda _i\) holds, \(\lambda _i\) and \(\lambda _j\) are said to be incomparable and we will denote it by \(\lambda _i \prec \succ _R \lambda _j\).

A partial order \(R\) is a total (or linear) order if it satisfies the additional completeness property stating that, for every \(\lambda _i, \lambda _j \in \Lambda \), either \(\lambda _i \succ _R \lambda _j\) or \(\lambda _j \succ _R \lambda _i\) must hold. As explained in the introduction, one of the main contributions of this paper is to introduce various means to perform inferences on the space \(\mathcal {R}\) or on one of its subspace, given that our initial information comes in the form of belief functions bearing on the pairs \(\lambda _i,\lambda _j\), \(i < j\) of labels.

Considering such pairwise decomposition is common when dealing with orders, as the space we are working on quickly becomes huge as \(c\) increases. Indeed, with \(c=8\), there are already 431,723,379 possible partial order relations (source On-line Encyclopedia of Integer Sequences, http://oeis.org/A001035) and \(8!=40,320\) linear orders. Directly working on \(\mathcal {R}\) would be computationally prohibitive and hence the interest of decomposing the problem into a set of simpler ones.

3.2 Combination of pairwise belief functions

For each pair \(\{\lambda _i,\lambda _j\}\), \(i < j\) we consider that the information about the relation between \(\lambda _i\) and \(\lambda _j\) provided by some source (e.g., a classifier as in Sect. 4, a decision maker, a recommendation system) is expressed using a mass function quantifying the uncertainty related to this relation and denoted by \(m^{{\mathrm \Theta _{ij}}}\). This mass function has the following general form:

$$\begin{aligned} \left\{ \begin{array}{l} m^{{\mathrm \Theta _{ij}}}({\mathrm \lambda _i \succ \lambda _j}) = \alpha _{ij}, \\ m^{{\mathrm \Theta _{ij}}}({\mathrm \lambda _j \succ \lambda _i}) = \beta _{ij}, \\ m^{{\mathrm \Theta _{ij}}}({\mathrm \lambda _i \prec \succ \lambda _j}) = \gamma _{ij}, \\ m^{{\mathrm \Theta _{ij}}}(\mathcal {A})=1-\alpha _{ij}-\beta _{ij}-\gamma _{ij}, \end{array} \right. \end{aligned}$$
(9)

with \({\mathrm \lambda _i \succ \lambda _j},{\mathrm \lambda _j \succ \lambda _i}\), and \({\mathrm \lambda _i \prec \succ \lambda _j}\) being short notations for the events \(\{R \in \mathcal {A}: (\lambda _i, \lambda _j) \in R\}\), \(\{R \in \mathcal {A}: (\lambda _j, \lambda _i) \in R\}\), and \(\{R \in \mathcal {A}: (\lambda _i, \lambda _j) \not \in R \text{ and } (\lambda _j, \lambda _i) \not \in R \}\), respectively. Note that the above mass may give a positive weight to incomparability, which implicitly means that such incomparability can be observed. While this is a reasonable assumption in some cases, such as when decision makers are unable to compare two alternatives, or in learning problems where observations are partial orders (e.g., multilabel prediction, see Sect. 4), in others (e.g., label ranking) assuming incomparability can be observed may turn out to be unreasonable, in which case one can simply impose \(\gamma _{ij}=0\) in (9).

When the observation is the preference \({\mathrm \lambda _i \succ \lambda _j}\) with some reliability \(\alpha _{ij}\), then this can be modelled by setting \(\gamma _{ij}=0\) and \(\beta _{ij}=0\) in Expression (9). Similarly, observing the preference \({\mathrm \lambda _j \succ \lambda _i}\) with some reliability \(\beta _{ij}\) can be modelled by setting \(\gamma _{ij}=0\) and \(\alpha _{ij}=0\).

The combination of the pairwise mass functions defined by (9) may be seen as an information fusion problem and Dempster’s rule of combination may be used to this end. Applying Dempster’s rule yields

$$\begin{aligned} m^{\mathcal {A}} = m^{\Theta _{12} } \oplus m^{\Theta _{13} } \oplus \cdots \oplus m^{\Theta _{(n-1)n}}, \end{aligned}$$
(10)

This combination can be computed equivalently using the commonalities by

$$\begin{aligned} \text {q}^{\mathcal {A}} = \prod _{i<j} \text {q}^{\Theta _{ij} }. \end{aligned}$$
(11)

Since the focal elements of two distinct masses \(m^{\Theta _{ij}}\) and \(m^{\Theta _{k\ell }}\) will contain information about different pairs of labels \(\{\lambda _i,\lambda _j\}\), \(\{\lambda _k,\lambda _\ell \}\), the intersections between the focal elements of \(m^{\Theta _{ij}}\) and of \(m^{\Theta _{k\ell }}\) will be non-empty, and the results of these intersections will be asymmetric relations. In particular, this also means that in our case the value of \(K\) in (6) will be null.

Example 1

Let \(\Lambda =\{\lambda _1,\lambda _2,\lambda _3\}\). Let us consider the following focal elements, respectively, taken from \(m^{\Theta _{12}}\), \(m^{\Theta _{13}}\) and \(m^{\Theta _{23}}\)

  • \({\mathrm \lambda _{1} \succ \lambda _{2}}:=\{\{(\lambda _1,\lambda _2),(\lambda _2,\lambda _3),(\lambda _1,\lambda _3)\},\) \(\{(\lambda _1,\lambda _2),(\lambda _2,\lambda _3),(\lambda _3,\lambda _1)\}, \) \(\{(\lambda _1,\lambda _2),(\lambda _3,\lambda _2),(\lambda _1,\lambda _3)\},\) \(\{(\lambda _1,\lambda _2),(\lambda _2,\lambda _3),(\lambda _1,\lambda _3)\}, \{(\lambda _1,\lambda _2),(\lambda _3,\lambda _2)\}, \) \(\{(\lambda _1,\lambda _2),(\lambda _2,\lambda _3)\},\) \(\{(\lambda _1,\lambda _2),(\lambda _1,\lambda _3)\},\) \(\{(\lambda _1,\lambda _2),(\lambda _3,\lambda _1)\},\{(\lambda _1,\lambda _2)\} \}\)

  • \({\mathrm \lambda _{1} \prec \succ \lambda _{3}}:=\{\{(\lambda _1,\lambda _2),(\lambda _2,\lambda _3)\},\{(\lambda _1,\lambda _2),(\lambda _3,\lambda _2)\},\) \(\{(\lambda _2,\lambda _1),(\lambda _2,\lambda _3)\},\) \(\{(\lambda _2,\lambda _1),(\lambda _3,\lambda _2)\},\) \(\{(\lambda _1,\lambda _2)\},\{(\lambda _2,\lambda _1)\},\{(\lambda _3,\lambda _2)\},\{(\lambda _2,\lambda _3)\}\}\)

  • \(\mathcal {A}\)

The focal element \(\phi \) resulting from the intersection of \({\mathrm \lambda _{1} \succ \lambda _{2}}\), \({\mathrm \lambda _{1} \prec \succ \lambda _{3}}\) and \(\mathcal {A}\) is the set composed of the three following partial asymmetric relations:

$$\begin{aligned} \phi =\{\{(\lambda _1,\lambda _2),(\lambda _2,\lambda _3)\},\{(\lambda _1,\lambda _2),(\lambda _3,\lambda _2)\},\{(\lambda _1,\lambda _2)\}\}. \end{aligned}$$

The mass of \(\phi \) resulting from the combination is

$$\begin{aligned} m^{\mathcal {A}}(\phi ) = m^{\Theta _{12}}({\mathrm \lambda _{1} \succ \lambda _{2}}) m^{\Theta _{13}}({\mathrm \lambda _{1} \prec \succ \lambda _{3}}) m^{\Theta _{23}}(\mathcal {A}). \end{aligned}$$
(12)

3.3 Inferences on partial orders

The combination (10) allocates masses to various sets of asymmetric relations defined on \(\Lambda \times \Lambda \). However, the focus of this paper is on partial orders \(\mathcal {R}\) (a subset of \(\mathcal {A}\)), and we will now explain how we can go from a model on the space \(\mathcal {A}\) to inferences on \(\mathcal {R}\). From (10), finding a partial order belonging to some specific set \(\mathcal {S} \subseteq \mathcal {R}\) can be done by finding the element within \(\mathcal {S}\) that has the maximum plausibility, that is by finding

$$\begin{aligned} \hat{\omega }=\mathop { \hbox { argmax }}\limits _{{\omega \in {\mathcal {S}}}} \text {pl}^{\mathcal {A}}(\{{\omega }\}). \end{aligned}$$
(13)

Using the maximum of plausibility on singletons has been recently proposed as an efficient decision tool in many problems involving belief functions (El Zoghby et al. 2013; Denœux and Masson 2012). It also corresponds to a maximax type of decision (thus adopting an “optimistic” criterion, Troffaes 2007) or to the decision that would be taken using the plausibility transform (Cobb and Shenoy 2006).

Rather than directly computing (13), we could first condition on the space \(\mathcal {S}\) and then find the element of the conditioned belief function with the highest plausibility. Actually, the decision given by (13) is also consistent with such an approach. Indeed, recall that conditioning on a set \(B\) with Dempster’s conditioning gives the conditioned plausibility measure , hence for any \(\omega ,\omega ' \in B\) if \(\text {pl}(\omega ) \le \text {pl}(\omega ')\), then , meaning that the ordering on singleton plausibilities before or after conditioning remains unchanged.

Using the fact that plausibilities coincide with commonalities on singletons, plausibilities of singletons can be easily expressed as a product of plausibilities of combined belief functions using Eq. (11). Let us consider an element \(R\) of \(\mathcal{{R}}\). As \(R\) is a singleton, we have that

$$\begin{aligned} \left\{ \begin{array}{ll} \text {q}^{\Theta _{ij}}(\{R\})=1-\beta _{ij}-\gamma _{ij} &{} \quad \text{ if } \,\, \lambda _i \succ _R \lambda _j, \\ \text {q}^{\Theta _{ij}}(\{R\})=1-\alpha _{ij}-\gamma _{ij} &{} \quad \text{ if } \,\, \lambda _j \succ _R \lambda _i, \\ \text {q}^{\Theta _{ij} }(\{R\})=1-\alpha _{ij}-\beta _{ij} &{} \quad \text{ if } \,\, \lambda _j \prec \succ _R \lambda _i. \end{array} \right. \end{aligned}$$
(14)

Using (14) and (8), the commonality/plausibility of \(R\) can be written as

$$\begin{aligned} \text {q}^{\mathcal {A}}(\{R\})&=\text {pl}^{\mathcal {A}}(\{R\}) \propto \!\!\!\! \prod _{\lambda _i \succ _{R} \lambda _j}\!\!\!\! (1-\beta _{ij}-\gamma _{ij})&\nonumber \\&\quad \times \!\!\!\! \prod _{\lambda _l \succ _{R} \lambda _k} \!\!\!\! (1-\alpha _{kl}-\gamma _{kl}) \!\!\!\! \prod _{\lambda _{m} \prec \succ _{R} \lambda _n} \!\!\!\! (1-\alpha _{mn}-\beta _{mn}).&\end{aligned}$$
(15)

Given the size of the space \(\mathcal {R}\) (or one of its subset of interest), finding the most plausible order relation obviously cannot be done by enumeration. This is why we propose a generic approach to get this most plausible relation, this approach consisting of reformulating the problem as a binary integer linear programming problem to which can then be applied state-of-art techniques issued from optimization. First, we introduce the binary variables \(r_{ij}\), \(i,j=1,\ldots c\) defined by

$$\begin{aligned} \left\{ \begin{array}{ll} \lambda _i \succ _R \lambda _j \iff r_{ij}=1\,\, \text{ and } \,\,r_{ji}=0, \\ \lambda _i \prec \succ _R \lambda _j \iff r_{ij}=0\,\, \text{ and } \,\,r_{ji}=0. \\ \end{array} \right. \end{aligned}$$

Let us now define, for each \(i<j\):

$$\begin{aligned} \left\{ \begin{array}{ll} X_{ij}^{(1)}=r_{ij}(1-r_{ji}), \\ X_{ij}^{(2)}=r_{ji}(1-r_{ij}), \\ X_{ij}^{(3)}=(1-r_{ij})(1-r_{ji}). \\ \end{array} \right. \end{aligned}$$

Expression (15) can be rewritten using these new binary variables as

$$\begin{aligned} \text {pl}^{\mathcal {A}}(\{R\}) \propto \prod _{i<j} (1-\beta _{ij}-\gamma _{ij})^{X_{ij}^{(1)}}(1-\alpha _{ij}-\gamma _{ij})^{X_{ij}^{(2)}}(1-\alpha _{ij}-\beta _{ij})^{X_{ij}^{(3)}}. \end{aligned}$$
(16)

Maximizing expression (16) is equivalent to maximizing its logarithm so that the most plausible relation \(\hat{\omega } \in \mathcal {A}\) may be found as the solution of the following binary integer programming problem:

$$\begin{aligned}&\max _{X_{ij}^{(k)} \in \{0,1\}} \sum _{ i<j} X_{ij}^{(1)} \ln {(1-\beta _{ij}-\gamma _{ij})}&\nonumber \\&\quad +\, \sum _{i<j} X_{ij}^{(2)} \ln {(1-\alpha _{ij}-\gamma _{ij})}&\nonumber \\&\quad +\,\, \sum _{ i<j} X_{ij}^{(3)} \ln {(1-\alpha _{ij}-\beta _{ij})}.&\end{aligned}$$
(17)

subject to

$$\begin{aligned} \sum _{k=1}^3 X_{ij}^{(k)}=1 \quad \forall i<j, \end{aligned}$$
(18)

this constraint ensuring that only one alternative between \(\lambda _i \succ \lambda _j\), \(\lambda _j \succ \lambda _i\) and \(\lambda _i \prec \succ \lambda _j\) will be chosen. Depending on the nature of the relation we want to retrieve (or condition on), additional constraints may be imposed, as we shall now detail for partial orders, linear orders and bipartite rankings.

Partial orders the property of transitivity satisfied by partial orders can be encoded by the following constraint:

$$\begin{aligned} r_{ij} + r_{jk} - 1 < r_{ik } \quad \forall i,j,k, \end{aligned}$$
(19)

that has to be added to (18) to ensure that the solution of (17) will be a partial order. It can be expressed using the variables of the problem since the following equations hold:

$$\begin{aligned} r_{ij}= \left\{ \begin{aligned} 1-X_{ij}^{(2)}-X_{ij}^{(3)}&\quad \text{ if } \,\, i<j, \\ 1-X_{ij}^{(1)}-X_{ij}^{(3)}&\quad \text{ if } \,\, i>j. \end{aligned} \right. \end{aligned}$$
(20)

Linear orders a linear order is a partial order without incomparability. Basically, there are two ways within our framework to obtain linear orders: either set \(\gamma _{ij}=0\) for all \(i \le j\) in (9) and solve (17) under constraints (18)–(19), so that partial orders with incomparability cannot have the highest plausibility, or simply add the constraint

$$\begin{aligned} X_{ij}^{(3)}=0 \quad \forall i<j \end{aligned}$$
(21)

stating that \(\lambda _i,\lambda _j\) cannot be incomparable.

Bipartite rankings a (partial) bipartite ranking consists in dividing the objects or labels in two subsets: the preferred ones and the non-preferred ones. In practice, this means that there cannot be three objects \(\lambda _i,\lambda _j,\lambda _k\) such that \(\lambda _i \succ \lambda _j \succ \lambda _k\), as there are only two subsets. Such kind of partial orders are at work, e.g., in multilabel classification problems. This can be encoded by adding the constraints

$$\begin{aligned} r_{ij} + r_{jk} \le 1 \quad \forall i,j,k. \end{aligned}$$
(22)

to constraint (18). Multipartite ranking, where \(K\) ordered subsets \(L_1,\ldots ,L_k\) of objects must be constructed, can be modelled by extending constraints (22) to more than triplets of objects and by adding transitivity constraints.

Partial orders, linear orders and multipartite rankings are only some examples illustrating the flexibility of our framework, since many constraints are very easy to formulate within it. For instance, it is straightforward to impose some (known) relation to hold, simply by constraining the variables \(X_{ij}^{(k)}\) to adequate values. It would also be interesting to investigate if some specific families of partial orders, such as interval orders or semi-orders, can be easily expressed through constraints.

4 Applications

We now illustrate how our approach can be applied to some well-known and difficult machine learning problems, namely label ranking and multilabel classification. These kinds of problems find applications in various domains like music or text categorization, bioinformatics, semantic classification of images, recommendation systems (Boutell et al. 2004; Elisseeff and Weston 2001; Li and Ogihara 2006; Ueda and Saito 2002, ...). In particular, we explore how partial rankings can be predicted for the label ranking problem using our approach and the interest of explicitly modelling the incomparability in the multilabel problem.

In traditional (single-label) classification, it is assumed that each observation \({\mathbf x}\) of an input space \(\mathcal{{X}}\) is associated with a single label \(\lambda \) from a finite possible set \(\Lambda =\{\lambda _1,\ldots ,\lambda _c\}\). The task is then to learn the mapping or classifier from \(\mathcal{{X}}\) to \(\Lambda \) using a training set of \(n\) observations \(({\mathbf x}_i,y_i)\) with \({\mathbf x}_i\) in \(\mathcal{{X}}\) and \(y_i \in \Lambda \). Label ranking and multilabel settings differ from the assumptions of the traditional one in the sense that it is assumed that with each observation is associated a linear order (possibly imprecisely observed) in the case of label ranking and a set of relevant labels defining a bipartite ranking in the case of multilabel prediction.

4.1 Label ranking

In the label ranking setting Vembu and Gärtner (2011), each instance \({\mathbf x}_i\) is associated with a linear order \(R_i\) over \(\Lambda \), possibly partially observed. Rather than training one classifier over all possible orderings (whose number is \(c!\)), a common strategy is to decompose the problem into pairwise preferences, and to train classifiers, that is one for each pair of labels (Hüllermeier et al. 2008). Other approaches propose to estimate the mapping between instances and complete or partial rankings by learning a utility function for each label (e.g., by constraint classification or log-linear models), from which a ranking can then be deduced (Dekel et al. 2003; Har-Peled et al. 2003).

The pairwise decomposition applied to an illustrative data set is pictured by Fig. 1. From this figure, one can see that this decomposition tackles the problem of missing data in a straightforward way, as missing pairwise preferences will simply mean less data in the corresponding split.

Fig. 1
figure 1

Pairwise decomposition in case of label ranking

In the usual label ranking setting, the goal is to predict a linear order as close as possible to the observed one (note that, in the case of label ranking, the notion of closeness can be modelled by various loss functions, Hüllermeier et al. 2008). However, some authors have recently discussed the interest of producing not linear orders, but partial orders as predictions (Cheng et al. 2010). The idea behind such predictions is close to the one of the reject option (Chow 1970) in traditional classifiers or to credal classification (Zaffalon 2002), that is to abstain to make precise predictions when the available information is insufficient.

As our framework is tailored to infer partial orders, we first apply it in this context of label ranking with partial abstention and compare it to the approach of Cheng et al. (2010), first considered by Rademaker and De Baets (2010) in the general setting of partial order aggregation. The method of Cheng et al. is based on the principle of classifier ensembles. Instead of training a single model, the idea is to train \(k\) binary probabilistic classifiers by creating \(k\) bootstrap samples from the original data. In this way, \(k\) estimates \(\hat{p}_{ij}^\ell \), \(\ell =1,\ldots ,k\) of the probability \({\mathbb {P}}(\lambda _i \succ \lambda _j)\) are thus available. For each pair of labels \(\{\lambda _i,\lambda _j\}\), a preference degree \(P(\lambda _i,\lambda _j)\) is defined as the fraction of classifiers for which \(\lambda _i\) is preferred \(\lambda _i\) to \(\lambda _j\):

$$\begin{aligned} P(\lambda _i,\lambda _j)=\frac{1}{k} |\{\ell :\hat{p}_{ij}^\ell > 0.5 \}|. \end{aligned}$$
(23)

Then, a binary relation is derived by thresholding the preference degrees:

$$\begin{aligned} R_\alpha =\{(\lambda _i,\lambda _j): P(\lambda _i,\lambda _j) \ge \alpha \}, \end{aligned}$$
(24)

where the threshold \(\alpha \) is such that \(\alpha >\frac{\lfloor k/2\rfloor }{k}\), where \(\lfloor x \rfloor \) denotes the integer part of \(x\). Note that, by construction, the number of possibly different values in \(P\), and consequently the number of possible thresholds, is directly related to the number of bootstrap replicates. The authors underline that, for a given threshold, especially if it is low, the relation obtained is not necessarily transitive and may even contain cycles. The transitivity may be easily enforced by computing the transitive closure, but guaranteeing the absence of cycle is more problematic. The authors propose a procedure to find the minimum value of \(\alpha \) such that the transitive closure of the partial relation is a strict partial order relation. By varying the value of the threshold, different relations are obtained: the larger the \(\alpha \) value, the less informative the corresponding relation (i.e., the more incomparabilities it contains).

To compare our approach with the one of Cheng et al. (2010), we follow the following procedure:

  1. 1.

    we build \(k\) bootstrap replicates of the learning set;

  2. 2.

    from each bootstrap sample, we train classifiers;

  3. 3.

    for a fixed threshold \(\tau \), we compute \(m^\Theta _{ij}\) as follows:

    $$\begin{aligned} \left\{ \begin{array}{l} m^{{\mathrm \Theta _{ij}}}({\mathrm \lambda _i \succ \lambda _j}) = \frac{1}{k} |\{l: \hat{p}_{ij}^l > 0.5 + \tau \}|, \\ m^{{\mathrm \Theta _{ij}}}({\mathrm \lambda _j \succ \lambda _i}) = \frac{1}{k} |\{l: \hat{p}_{ij}^l < 0.5 - \tau \}|, \\ m^{{\mathrm \Theta _{ij}}}({\mathrm \lambda _i \prec \succ \lambda _j}) = \frac{1}{k} |\{l: 0.5-\tau <\hat{p}_{ij}^l < 0.5 + \tau \}|, \\ \end{array} \right. \end{aligned}$$
    (25)

    where it is clear that the higher \(\tau \) is, the more incomparabilities are favoured.

  4. 4.

    To take account of the different performances of each classifier, a discounting operation (Shafer 1976) is applied to each mass \(m^{{\mathrm \Theta _{ij}}}\). For a given mass, let \(\epsilon \in [0,1]\) be the average of the \(k\) values of mean squared error obtained on the training set; then \(m^{{\mathrm \Theta _{ij}}}\) is transformed into \(^\epsilon m^{{\mathrm \Theta _{ij}}}\) such that

    $$\begin{aligned} \left\{ \begin{array}{l} ^\epsilon m^{{\mathrm \Theta _{ij}}}(A)=(1-\epsilon )m(A) \quad \forall A \ne \mathcal {A}, \\ ^\epsilon m^{{\mathrm \Theta _{ij}}}(\mathcal {A})=(1-\epsilon )m(\mathcal {A})+\epsilon , \\ \end{array} \right. \end{aligned}$$
    (26)

    The coefficient \(\epsilon \) is usually interpreted as the unreliability of the source of information (Smets 1993): when \(\epsilon \) is equal to 1, \(m\) becomes the vacuous mass function, and when \(\epsilon =0\) the information is fully reliable and \(m\) remains unchanged.

  5. 5.

    The most plausible partial order is computed by solving problem (17) under constraints (18)–(19). By varying the value of \(\tau \) (\(\tau \in [0;0.5[\)), different partial orders are obtained: the larger the \(\tau \) value, the less informative the corresponding relation.

We applied this strategy to eight data sets presented in Hüllermeier et al. (2008). The first six datasets are multiclass data sets from the UCI machine learning repository (Bache and Lichman 2013) that have been transformed into label ranking data by a procedure proposed in Hüllermeier et al. (2008). A naive Bayes classifier was trained on the entire data set. Then, for each instance, the labels were ordered according to the predicted class probabilities.Footnote 2 The last two datasets (cold and heat) are real-world data sets originating from the bioinformatics field and are described in Hüllermeier et al. (2008). All these data sets are described in Table 1.

Table 1 Data sets description for label ranking

To evaluate the performance of the methods, we use two measures proposed in Cheng et al. (2010). The first one, correctness, quantifies how the predicted (partial) ranking matches the observed ranking, whereas the second one is intended to measure the degree of completeness of the relation. A good method predicting partial orders should see its correctness increase as the completeness decrease, and there is usually a trade-off to find between these two criteria. They can be formally defined as follows: let \(({\mathbf x}_i,L_i)\), \(i=1,n\), denote the test set. \(L_i\) is the true linear order relation on \(\Lambda \times \Lambda \) and let \(R_i\) denote the partial (or linear) order computed by the method. For each \({\mathbf x}_i\), a pair of labels \(\{\lambda _k,\lambda _l\}\) is said to be concordant if

$$\begin{aligned}&\!\!\!((\lambda _k,\lambda _l) \in L_i\,\, \text{ and } \,\, (\lambda _k,\lambda _l) \in R_i)\,\, \text{ or } \\&\qquad \qquad \qquad ((\lambda _l,\lambda _k) \in L_i \,\, \text{ and } \,\, (\lambda _l,\lambda _k) \in R_i). \end{aligned}$$

It is said to be discordant if

$$\begin{aligned}&\!\!\!((\lambda _k,\lambda _l) \in L_i\,\, \text{ and } \,\, (\lambda _l,\lambda _k) \in R_i)\,\, \text{ or } \\&\!\!\! \quad ((\lambda _l,\lambda _k) \in L_i \,\, \text{ and } \,\, (\lambda _k,\lambda _l) \in R_i). \end{aligned}$$

Let \(c_i\) and \(d_i\) denote the number of concordant and discordant pairs of alternatives for sample \(({\mathbf x}_i,L_i)\), respectively. The correctness measure for the test set is defined as

$$\begin{aligned} \mathrm{Correctness} = \frac{1}{n} \sum _{i=1}^n \frac{c_i-d_i}{c_i+d_i}, \end{aligned}$$

whereas the completeness is defined by

$$\begin{aligned} \mathrm{Completeness} = \frac{1}{n} \sum _{i=1}^n \frac{c_i+d_i}{n(n-1)/2}. \end{aligned}$$

The performances of our method were compared to those of Cheng et al. (2010). For both methods, the base learner was a logistic regression. The number of bootstrap replicates was fixed to 11. The results reported in Fig. 2 are the mean values computed over five repetitions of a tenfold cross-validation procedure. As already explained, for some instances, Cheng’s et al. method fails at finding an acyclic relation for low values of threshold, unlike our method which is able to find a solution in any situation. As an illustration, Table 2 gives the average rate of non-responses of Cheng’s method, computed over five repetitions for different data sets with \(\alpha =0.5\). The instances for which no relation is found are usually complex cases. So, for the comparison to be fair, completeness and correctness are computed only over instances for which a relation is found by both methods.

Fig. 2
figure 2

Correctness versus completeness for different data sets (dashed line, triangles: Cheng et al. method; solid line, squares: our method)

Table 2 Non-response rate for Cheng’s et al. method (\(\alpha =0.5\))

Our results and the results of Cheng et al. are represented, respectively, by solid and dashed lines. From these experiments, the following conclusions may be drawn:

  1. 1.

    As expected, for both methods, partial abstention leads to improved correctness, that is the abstention is done on poorly reliable predictions. For some data sets, a significant gain in performance can be achieved without losing too much completeness.

  2. 2.

    The performances of the two methods are very similar for large values of completeness, but it can be seen that our approach (1) usually provides better correctness results when completeness decreases and (2) is able to span a wider range of completeness values, as it can go from linear to vacuous order.

Another, perhaps surprising observation is that providing correct predictions seems much more difficult on the genuine label ranking data sets (on which our method clearly performs better when allowing for partial orders) than on the synthetic label ranking data sets. This suggests that the synthetic data sets are much more regular than genuine label ranking problems, in which case it appears important to consider the two kind of sets when assessing label ranking methods.

Note that we have checked that always giving a response does not decrease dramatically the performance of our method. For example, looking for a complete relation, the correctness degree for the iris data set is equal to 0.88 when it is computed using the entire data set instead of 0.9, and to 0.86 instead of 0.88 for the glass data set.

4.2 Multilabel classification

In the multilabel setting (Tsoumakas et al. 2010), each instance can be associated simultaneously with one or several labels. This association implicitly defines a bipartition of the labels into relevant and irrelevant labels. The task we consider in this section is thus to learn a mapping from \(\mathcal{{X}}\) to the powerset of \(\Lambda \) from the training data \(({\mathbf x}_i,Y_i)\), \(i=1,\ldots ,n\) with \({\mathbf x}_i\) in \(\mathcal{{X}}\) and \(Y_i \subseteq \Lambda \).

In recent years, many approaches have been suggested to solve this problem. A review of these methods can be found in Tsoumakas and Katakis (2007) and Tsoumakas et al. (2010). Also, Madjarov et al. (2012) have proposed an extensive experimental comparison of the methods. In their paper, the performances of 12 methods are studied according to different evaluation measures/loss functions. The methods are divided into three categories: algorithm adaptation, problem transformation and ensemble methods. Algorithm adaptation methods extend existing machine learning algorithms to the problem of multi-label classification. Problem transformation methods are multi-label learning methods that transform the multi-label problem into one or more single-label problems, like the pairwise approach used in this paper. Ensemble methods use problems transformation methods or algorithms adaptation as base classifiers. A summary of all these methods is given in Table 3.

Table 3 Methods for multilabel classification

Classically, the transformation of the multilabel problem into an ordering problem somehow assumes that observations \(Y_i\) are incomplete observations of complete rankings (Fürnkranz et al. 2008). The task is then to learn how to predict complete rankings out of these observations, and a multilabel prediction is then obtained by “cutting off” the order between relevant and irrelevant labels.

Our approach is different, and as far as we know has not been proposed before in a multilabel setting: we try to predict not a complete order, but directly a partial order corresponding to a bipartite ranking. In particular, this means that we include incomparabilities as observations. More precisely, in the training set labels incomparability between \(\lambda _j, \lambda _k\) will be observed in \(Y_i\) if either \(\lambda _j, \lambda _k \in Y_i\) or \(\lambda _j, \lambda _k \not \in Y_i\); otherwise \(\lambda _j \succ \lambda _k\) when \(\lambda _j \in Y_i, \lambda _k \not \in Y_i\). A noticeable difference with viewing multilabel observations as incompletely observed linear ordering is that in our case, if all multilabel observations are precise, all pairwise training data sets contain as many data as the original data set, that is there are no “missing” preferences (Fig. 3).

Fig. 3
figure 3

Pairwise decomposition in case of multilabel classification

We compare the results given in Madjarov et al. (2012) with the results obtained with our method using three usual data sets: emotions, scene and yeast, which are described in Table  4. The performances of the methods are evaluated using four popular evaluation criteria for multi-label classification: the ranking loss, the Hamming loss, the \(F_1\)-measure and the accuracy. Let \(({\mathbf x}_i,Y_i)\), \(i=1,\ldots ,n\), \(Y_i \subseteq \Lambda \), denote the set of test samples. Let \(\widehat{Y_i}\) denote the set of predicted labels for instance \({\mathbf x}_i\). The evaluation criteria are defined as follows:

Table 4 Data sets description for multilabel classification

The Hamming loss evaluates how many times a label is misclassified, i.e., a label not belonging to the instance is predicted or a label belonging to the instance is not predicted. The measure has to be minimized (performance is perfect when \(\mathrm{Hammingloss}=0\)) and is defined by

$$\begin{aligned} \mathrm{Hamming\,\, loss} = \frac{1}{n} \sum _{i=1}^n \frac{|Y_i \Delta \widehat{Y_i}|}{c}, \end{aligned}$$
(27)

where \(\Delta \) stands for the symmetric difference between two sets and \(c\) for the number of labels.

The accuracy is the average over the test samples of the Jaccard coefficient between the sets \(Y_i\) and \(\widehat{Y_i}\); it has to be maximized:

$$\begin{aligned} \mathrm{Accuracy} = \frac{1}{n} \sum _{i=1}^n \frac{|Y_i \cap \widehat{Y_i}|}{|Y_i \cup \widehat{Y_i}|} \end{aligned}$$
(28)

\(F_1\) score is the harmonic mean between two other evaluation measures, namely precision and recall, and is computed as follows:

$$\begin{aligned} F_1 = \frac{1}{n} \sum _{i=1}^n \frac{2 |Y_i \cap \widehat{Y_i}|}{|Y_i|+|\widehat{Y_i}|} \end{aligned}$$
(29)

The best value for \(F_1\) is 1 and the worst is 0. The ranking loss, which has to be minimized, evaluates the average fraction of pairs of labels that are reversely ordered:

$$\begin{aligned} \mathrm{Ranking\,\, loss} = \frac{1}{n} \sum _{i=1}^n \frac{|D_i|}{|Y_i||\widehat{Y_i}|}, \end{aligned}$$
(30)

where \(D_i=\{(\lambda _i,\lambda _j)| (\lambda _i,\lambda _j) \in Y_i\times \overline{Y_i} \text{ and } (\lambda _i,\lambda _j) \in \overline{\widehat{Y_i}}\times \widehat{Y_i}\}\). Note that this \(D_i\) is formally equivalent to the number of discordant pairs used for evaluating label ranking in Sect. 4.1, but is here used in a different way (in particular, there is no need to split between correctness and completeness, as all predictions are complete).

As binary classifiers, we used the evidential \(k\)NN (Denœux 1995) method with three classes (\({\mathrm \lambda _{i} \succ \lambda _{j}}\!,\!{\mathrm \lambda _{j} \succ \lambda _{i}},\!\!{\mathrm \lambda _{i} \prec \succ \lambda _{j}}\)), which directly provides a mass function in a form of (25).Footnote 3 The optimum number of nearest neighbours (the same for all classifiers) is determined using five repetitions of tenfold cross-validation procedure on the learning set. This optimum number is used to evaluate the method on the test set.

Tables 56 and 7 summarizes our results on the various data sets in a synthetic way. As we can see, this approach is worth exploring further, as we were able to obtain very good results on some data sets: we are consistently better on the Emotions dataset, and we perform very well on the Yeast dataset for both ranking loss and accuracy. Performances for Scene are worse, but this could be explained by the fact that KNN approaches can perform poorly when a high number of features are used.

Table 5 Results for the Scene dataset
Table 6 Results for the Yeast dataset
Table 7 Results for the Emotions dataset

Roughly speaking, as our approach is based on pairwise information between labels, we can expect to perform better on measures related to pairwise information, such as ranking loss (this is indeed the case for the Yeast data set). Exploring in detail the relation between the current methods and the minimization of some given loss function is out of the scope of the current paper, whose main scope remains making inferences about partial order in a general setting (not necessarily machine learning). This would nevertheless be an interesting study. In any case, empirical results show good performances.

5 Conclusion

In this paper, we have introduced a general uncertainty model, based on belief functions, to deal with partial orders in a pairwise way. Information on each pair of objects (alternatives, labels, ...) regarding ordering or incomparability is modelled by a belief function, and the belief functions of the different pairs are then combined through Dempster’s rule. We have then provided means to perform various efficient inference tasks based on the principle of maximal plausibility and on the use of integer linear program.

The application of the approach is then illustrated on two machine learning tasks involving partial orders, namely partial predictions in label ranking and bipartite ranking prediction in multilabel problems. In both cases our approach has proved to be competitive with other algorithms, thus demonstrating its potential interest.

It is worth mentioning that, thanks to the flexibility of our framework, we can in principle predict any partial order. This in contrast with other label ranking methods producing partial orders that can only make predictions that belong to specific families of partial orders, such as semi-orders (Cheng et al. 2012) or interval orders (Destercke 2013).

Given the flexibility of the present approach, it would be interesting to study its connections with other areas such as multi-criteria decision making (for example, we could try to describe the set of orders representable by popular models such a CP-net through constraints). Similarly, it may be worthwhile to explore other potential application domain involving orders, for example object ranking (Kamishima et al. 2011).