Keywords

1 Introduction

Learning and estimating probabilistic models over rankings of objects has received attention for a long time: earlier works can be traced back at least to the 1920s  [21]. Recently, this problem has known a revival, in particular due to the rising interest of machine learning in the issue  [12]. Popular approaches range from associating a random utility to each object to be ranked, from which a distribution on rankings is derived  [3], to directly defining a parametric distribution over the set of rankings  [19].

Multiple reasons motivate making cautious inferences of ranking models. The information at hand may be scarce—this is typically the case in the cold-start problem of a recommender system, or partial—for instance because partial rankings are observed (e.g., pairwise comparisons, top-k items). In addition, since inferring a ranking model is difficult and therefore prone to uncertainty, it may be useful to output partial rankings as predictions, thus abstaining to predict when information is unreliable.

Imprecise probability theory is a mathematical framework where partial estimates are formalized in the form of sets of probability distributions. Therefore, it is well suited to making cautious inferences and address the aforementioned problems; yet, to our knowledge, it has not yet been applied to ranking models.

In this paper, we use the imprecise probabilistic framework to infer a imprecise Plackett–Luce model, which is a specific parametric model over rankings, from data. We present the model in Sect. 2. We address its inference in Sect. 3, showing that for this specific parametric model, efficient methods can be developed to make cautious inferences based on sets of parameters. Section 4 will then present a direct application to label ranking, where we will use relative likelihoods  [5] to proceed with imprecise model estimation.

2 Imprecise Plackett–Luce Models

In this paper, we consider the problem of estimating a probabilistic ranking model over a set of objects or labels . This model defines probabilities over total orders on the labels—that is, complete, transitive, and asymmetric relations \(\succ \) on \({\Lambda }\). Any complete order \(\succ \) over the labels can be identified with its induced permutation or label ranking \({\tau _{}}\), that is the unique permutation of \({\Lambda }\) such that

We will use the terms “order on the labels”, “ranking” and “permutation” interchangeably. We denote by all n! permutations on \({\Lambda }\), and denote a generic permutation by \({\tau _{}}\).

We focus on the particular probability model known as the Plackett–Luce (PL) model  [6, 13]. It is parametrised by \(n\) parameters or strengths \(v_{1}\), ..., \(v_{n}\) in .Footnote 1 The strength vector completely specifies the PL model. For any such vector, an arbitrary ranking \({\tau _{}}\) in \({\mathcal {L}}\) is assigned probability

(1)

Clearly, the parameters \(v_{1}\), ..., \(v_{n}\) are defined up to a common positive multiplicative constant, so it is customary to assume that \(\sum _{k=1}^nv_{k}=1\). Therefore, the parameter \(v_{}=(v_{1},\dots ,v_{n})\) can be regarded as an element of the interior of the \(n\)-simplex , denoted .

This model has the following nice interpretation: the larger a weight \(v_{i}\) is, the more preferred is the label \({\lambda _{i}}\). The probability that \({\lambda _{i}}\) is ranked first is

conditioning on being the first label, the probability that \({\lambda _{j}}\) is ranked second (i.e. first among the remaining labels) is equal to \(v_{j}/\sum _{k=1,k\ne i}^nv_{k}\). This reasoning can be repeated for each of the labels in a ranking. As a consequence, given a PL model defined by \(v_{}\), finding the “best” (most probable) ranking amounts to finding the permutation \({\tau _{}}_{v_{}}^\star \) which ranks the strengths in decreasing order:

(2)

We obtain an imprecise Plackett–Luce (IPL) model by letting the strengths vary over a subset \(\Theta \) of .Footnote 2 Based on this subset of admissible strengths, we can compute the lower and upper probabilities of a ranking \({\tau _{}}\) as

The above notion of “best” ranking becomes ambiguous for an IPL model, since two vectors might be associated with different “best” rankings \({\tau _{}}_{v_{}}^\star \ne {\tau _{}}_{u_{}}^\star \).

Therefore, we consider two common ways to extend (2). The first one, (Walley–Sen) maximality  [22, 23], considers that \({\tau _{}}_1\) dominates \({\tau _{}}_2\) (noted \({\tau _{}}_1 \succ _M {\tau _{}}_2\)) if it is more probable for any \(v_{}\in \Theta \):

(3)

The set \(\mathcal {M}_\Theta \) of maximal rankings is composed of all such undominated rankings:

(4)

We may have \({\vert }\mathcal {M}_\Theta {\vert } >1\) when \(\Theta \) is imprecise.

The second one is E-admissibility  [18]. A ranking \({\tau _{}}\) is E-admissible if it is the “best”, according to Eq. (2), for some \(v_{}\in \Theta \). The set \(\mathcal {E}_\Theta \) of all E-admissible rankings is then

(5)

By comparing Eqs. (4) and (5), we immediately find that \(\mathcal {E}_\Theta \subseteq \mathcal {M}_\Theta \).

3 Learning an Imprecise Plackett–Luce Model

We introduce here two methods for inferring an IPL model. The first one (Sect. 3.1), which does not make further assumptions about \(\Theta \), provides an outer approximation of the set of all maximal rankings. The second one (Sect. 3.2) computes the set of E-admissible rankings via an exact and efficient algorithm, provided that the set of strengths \(\Theta \) has the form of probability intervals.

3.1 General Case

Section 2 shows that the “best” ranking is found using Eq. (2). In the case of an IPL model, making robust and imprecise predictions requires to compare all possible ranks in a pairwise way, the complexity of which is \(n!\)—and thus generally infeasible in practice. However, checking maximality can be simplified. Notice that the numerator in Eq. (1) does not depend on \({\tau _{}}\) (product terms can be arranged in any order). Hence, when comparing two permutations \({\tau _{}}\) and \(\tau '_{}\) using Eq. (3), only denominators matter: indeed, \({\tau _{}}\succ \tau '_{}\) iff for all \(v_{}\in \Theta \),

$$\begin{aligned} \frac{P_{v_{}}({\tau _{}})}{P_{v_{}}(\tau '_{})} = \frac{v_{\tau '_{}(1)}+\dots +v_{\tau '_{}(n)}}{v_{{\tau _{}}(1)}+\dots +v_{{\tau _{}}(n)}}\cdot \frac{v_{\tau '_{}(2)}+\dots +v_{\tau '_{}(n)}}{v_{{\tau _{}}(2)}+\dots +v_{{\tau _{}}(n)}}\cdots \frac{v_{\tau '_{}(n-1)}+v_{\tau '_{}(n)}}{v_{{\tau _{}}(n-1)}+v_{{\tau _{}}(n)}} > 1 . \end{aligned}$$
(6)

Assume for a moment that strengths are precisely known, and that \({\tau _{}}\) and \(\tau '_{}\) only differ by a swapping of two elements: \({\tau _{}}(k)=\tau '_{}(k)\) for all where \(i\ne j\), and \({\tau _{}}(j)=\tau '_{}(i)\), \({\tau _{}}(i)=\tau '_{}(j)\). Assume, without loss of generality, that \(i<j\). Then, the product terms in Eq. (6) only differ in the ratios involving rank j but not rank i; using furthermore \({\tau _{}}(i)=\tau '_{}(j)\), we get

$$\begin{aligned} \frac{P_{v_{}}({\tau _{}})}{P_{v_{}}(\tau '_{})}&= \prod _{\begin{array}{c} k=1\\ k\notin \{i+1,\dots ,j\} \end{array}}^n \underbrace{\frac{\sum _{\ell =k}^nv_{\tau '_{}(\ell )}}{\sum _{\ell =k}^nv_{{\tau _{}}(\ell )}}}_{=1}\cdot \prod _{k=i+1}^j\frac{\sum _{\ell =k}^nv_{\tau '_{}(\ell )}}{\sum _{\ell =k}^nv_{{\tau _{}}(\ell )}} = \prod _{k=i+1}^j\frac{v_{{\tau _{}}(i)}+\sum _{\ell =k,\ell \ne j}^nv_{\tau '_{}(\ell )}}{v_{{\tau _{}}(j)}+\sum _{\ell =k,\ell \ne j}^nv_{{\tau _{}}(\ell )}}. \end{aligned}$$

In this last ratio, we introduce now for any \(k\) in the sums of strengths \(C_k:=\sum _{\ell =k,\ell \ne j}^nv_{{\tau _{}}(\ell )}=\sum _{\ell =k,\ell \ne j}^nv_{\tau '_{}(\ell )}\): these terms being positive, it follows that

In the case of imprecisely known strengths, the latter inequality will hold whenever the following (sufficient, but not necessary) condition is met:

$$\begin{aligned} \underline{v}_{{\tau _{}}(i)}:=\inf _{v_{}\in \Theta }v_{{\tau _{}}(i)} > \overline{v}_{{\tau _{}}(j)}:=\sup _{v_{}\in \Theta }v_{{\tau _{}}(j)}. \end{aligned}$$

Now comes a crucial insight. Assume a ranking \({\tau _{}}\) which prefers \({\lambda _{\ell }}\) to \({\lambda _{k}}\) whereas \(\underline{v}_{k}>\overline{v}_{\ell }\), for some \(k \ne \ell \): then, we can find a “better” ranking \(\tau '_{}\) (i.e., which dominates \({\tau _{}}\) according to Eq. (3)) by swapping labels \({\lambda _{\ell }}\) and \({\lambda _{k}}\). In other terms, as soon as \(\underline{v}_{k} \ge \overline{v}_{\ell }\), all maximally admissible rankings satisfy \({\lambda _{k}} \succ {\lambda _{\ell }}\).

It follows that given an IPL model with strengths , we can deduce a partial ordering on objects from the pairwise comparisons of strength bounds: more particularly, we will infer that \({\lambda _{k}} \succ {\lambda _{\ell }}\) whenever \(\underline{v}_{k} \ge \overline{v}_{\ell }\). This partial ordering can be obtained easily; it may contain solutions that are not optimal under the maximality criterion, but it is guaranteed to contain all maximal solutions.

3.2 Interval-Valued Case

We assume here that strengths are interval-valued: ; that is, the set \(\Theta \) of possible strengths (called credal set hereafter) is defined by:

(7)

Note that we assume \(\underline{v}_{k}>0\) for each label \({\lambda _{k}}\): each object has a strictly positive lower probability of being ranked first. It follows that \(\overline{v}_{k}<1\), and thus . Such interval-valued strengths fall within the category of probability intervals on singletons  [1, Sect.  4.4], and are coherent (nonempty and convex) iff [10]:

(8)

From now on, we will assume this condition to hold, and thus that \(\Theta \) is coherent.

We are interested in computing the set of E-admissible rankings, i.e. rankings \({\tau _{}}\) such that there exists \(v_{}\in \Theta \) for which \({\tau _{}}\) maximises \(P_{v_{}}\) (see Sect. 2). Our approach relies on two propositions, the proofs of which will be omitted due to the lack of place.

Checking E-admissibility. We provide here an efficient way of checking whether a ranking \({\tau _{}}\) is E-admissible. According to Eq. (2), it will be the case iff \(v_{}\) is decreasingly ordered wrt to \({\tau _{}}\), i.e. \(v_{{\tau _{}}(1)} \ge v_{{\tau _{}}(2)} \ge v_{{\tau _{}}(3)} \ge \dots \)

Proposition 1

Consider any interval-valued parametrisation of an IPL model such as defined by Eq. (7), and any ranking \({\tau _{}}\) in . Then, \({\tau _{}}\) is E-admissible (i.e., \({\tau _{}}\in \mathcal {E}_\Theta \)) iff there exists an index such that

(9)

and

(10)

Checking E-admissibility via Proposition 1 has a polynomial complexity in the number \(n\) of labels. Indeed, we need to check \(n\) different values of \(k\): for each one, Eq. (9) requires to calculate a sum of \(n-1\) terms, and Eq. (10) to check \(n-1\) inequalities, which yields a complexity of \(2n(n-1)\).

Computing the Set of E-admissible Rankings. Although Eq. (9) opens the way to finding the set of E-admissible rankings, there are n! many candidate rankings: checking all of them is intractable.

We propose to address this issue by considering a search tree, in which a node is associated with a specific sequence of labels. Each subsequent node adds a new element to this sequence: a leaf is reached when the sequence corresponds to a complete ranking. By navigating the tree top-down, we may progressively check whether a sequence corresponds to the beginning of an E-admissible ranking. Should it not, all completions of the sequence can be ignored.

This requires a way of checking whether a sequence \(\kappa =(k_1, k_2, \dots , k_m)\), by essence incomplete, may be completed into an E-admissible ranking—i.e., whether we can find such that \({\tau _{}}(1)=k_1, {\tau _{}}(2)=k_2, \dots , {\tau _{}}(m)=k_m\). Proposition 2 provides a set of necessary and sufficient conditions to this end.

Proposition 2

Consider any coherent parametrisation of an IPL model such as defined by Eq. (7), and a sequence of distinct labels \(\kappa =(k_1,\dots ,k_m)\) of length \(m \le n-1\). Then, there exists an E-admissible ranking beginning with this initial sequence iff the following equations are satisfied for every j in :

here, \(\kappa _j\) (\(j=0,\dots ,m\)) is the sub-sequence of the j first labels in \(\kappa \) (by convention, \(\kappa _0\) is empty), and is the set of labels not appearing in \(\kappa _j\).

In the special case of \(m=1\), which is typically the case at depth one in the search tree, Eqs. (\(A_j\)), (\(B_j\)) and (\(C_j\)) reduce to:

Note that under the coherence requirement (8), Eq. () is a direct consequence of Eq. (\(B_1\)), but it is not the case for Eq. () when \(j\ge 2\).

Fig. 1.
figure 1

Probability intervals for Example 1

Fig. 2.
figure 2

Search tree for \(n=4\)

Example 1

Consider an IPL model that is defined by strength intervals , , and , displayed in Fig. 1 (the coherence of which can be checked using Eq. (8)).

Consider the tree in Fig. 2, which will help navigate the set of possible rankings with \(n=4\) labels. The left-most node at depth \(m=1\) corresponds to the sequence \(({\lambda _{1}})\); its left-most child (left-most node at depth \(m=2\)) to the sequence \(({\lambda _{1}},{\lambda _{2}})\). We can see that this sequence has been ruled out as a possible initial segment for an E-admissible ranking: no further completion (i.e., neither of the two rankings \(({\lambda _{1}},{\lambda _{2}},{\lambda _{3}},{\lambda _{4}})\) and \(({\lambda _{1}},{\lambda _{2}},{\lambda _{4}},{\lambda _{3}})\)) will be checked.

The sequence \(({\lambda _{1}},{\lambda _{3}},{\lambda _{2}})\) has been ruled out as well; however, the sequence \(({\lambda _{1}},{\lambda _{3}},{\lambda _{4}})\) has been considered as valid, and can be straightforwardly completed into a valid E-admissible ranking (since only one possible label remains). Eventually, all E-admissible rankings \({\tau _{}}=({\tau _{}}(1),{\tau _{}}(2),{\tau _{}}(3),{\tau _{}}(4))\) corresponding to the IPL model are

A possible strength vector for which \({\tau _{}}=(1,3,4,2)\) dominates all others is given by \(v_{}=(\nicefrac 58,\nicefrac 1{12},\nicefrac 16,\nicefrac 18)\): it can easily be checked that \(v_{}\in \Theta \) and that \(v_{{\tau _{}}(1)}=\nicefrac 58\ge v_{{\tau _{}}(2)}=\nicefrac 16\ge v_{{\tau _{}}(3)}=\nicefrac 18\ge v_{{\tau _{}}(4)}=\nicefrac 1{12}\), i.e. \({\tau _{}}\) is E-admissible according to Eq. (2). We provide below possible strength vectors for each of the E-admissible rankings associated with the IPL model considered:

Admissible strength vector \(v_{}\in \Theta \)

Corresponding ranking \({\tau _{}}\in \mathcal {E}_\Theta \)

\(v_{}=(v_{1},v_{2},v_{3},v_{4})\)

\({\tau _{}}=({\tau _{}}(1),{\tau _{}}(2),{\tau _{}}(3),{\tau _{}}(4))\)

\((\nicefrac 58,\nicefrac 1{12},\nicefrac 16,\nicefrac 18)\)

(1, 3, 4, 2)

\((\nicefrac 58,\nicefrac 1{12},\nicefrac 1{12},\nicefrac 5{24})\)

(1, 4, 2, 3)

\((\nicefrac 58,\nicefrac 1{12},\nicefrac 1{12},\nicefrac 5{24})\)

(1, 4, 3, 2)

\((\nicefrac 38,\nicefrac 1{12},\nicefrac 16,\nicefrac 38)\)

(4, 1, 3, 2)

Let us show that there is no E-admissible ranking \({\tau _{}}\) that starts for instance with (1, 2). Assume ex absurdo that such an E-admissible ranking \({\tau _{}}\) exists. This would imply that there exists \(v_{}\in \Theta \) such that , which by Eq. (2) would imply that \(\nicefrac {1}{12}=v_{2}\ge v_{4}\ge \underline{v}_{4}=\nicefrac 18\), which is impossible. \(\lozenge \)

Algorithm. Eqs. (\(A_j\)), (\(B_j\)) and (\(C_j\)) used in Proposition 2 to check the E-admissibility of a ranking with a given initial sequence of labels can be turned into an efficient algorithm. We can indeed proceed recursively: checking whether there exists an E-admissible ranking starting with \((k_1,\dots ,k_m)\) basically requires to check whether it is the case for \((k_1,\dots ,k_{m-1})\) and then whether Eqs. (\(A_j\)), (\(B_j\)) and (\(C_j\)) still hold for \(j=m\).

Algorithms 1 and 2 provide a pseudo-code version of this procedure. Note that as all branch-and-bound techniques, it does not reduce the worst-case complexity of building an E-admissible set. Indeed, if all the rankings are E-admissible—which typically happens when all probability intervals are wide, then no single branch can be pruned from the search tree. In that case, the algorithm navigates the complete tree, which clearly has a factorial complexity in the number of labels n. Then, even a simple enumeration of all E-admissible rankings has such a complexity.

However, in practice we can expect many branches of the tree to be quickly pruned: indeed, as soon as one of the Eqs. (\(A_j\)), (\(B_j\)) or (\(C_j\)) fail to hold, a branch can be pruned from the tree. We expect this to allow for efficient inferences in many circumstances.

figure u
figure v

4 An Application to Label Ranking

In this section, we explore an application of the IPL model to supervised learning of label rankings. Usually, supervised learning consists in mapping any instance \({\mathbf {x}}\in {\mathcal {X}}\) to a single (preferred) label \({\Lambda }=\{{\lambda _{1}},\ldots ,{\lambda _{n}}\}\) representing its class. Here, we study a more complex issue called label ranking, which rather maps \({\mathbf {x}}\in {\mathcal {X}}\) to a predicted total order \(\hat{y}\) on the labels in \({\Lambda }\)—or a partial order, should we accept to make imprecise predictions for the sake of robustness.

For this purpose, we exploit a set of training instances associated with rankings \(({\mathbf {x}}_i,{\tau _{i}})\), with , in order to estimate the theoretical conditional probability measure associated to an instance . Ideally, observed outputs \({\tau _{i}}\) should be complete orders over \({\Lambda }\); however, this is seldom the case, total orders being more difficult to observe: training instances are therefore frequently associated with incomplete rankings \({\tau _{i}}\) (i.e., partial orders over \({\Lambda }\)).

Here, we will apply the approach detailed in Sect. 3.1 to learning an IPL model from such training data, using the contour likelihood to get the parameter set corresponding to a specific instance \({\mathbf {x}}\).

4.1 Estimation and Prediction

Precise Predictions. In [7], it was proposed to use an instance-based approach: the predictions for any are made locally using its nearest neighbours.

Let \(\mathcal {N}_K({\mathbf {x}})\) stand for the set of nearest neighbours of \({\mathbf {x}}\) in the training set, each neighbour being associated with a (possibly incomplete) ranking \({\tau _{i}}\); and let \(M_i\) be the number of ranked labels in \({\tau _{i}}\). Using the classical instance-based assumption that distributions are locally identical (i.e., in the neighborhood of \({\mathbf {x}}\)), the probability of observing \({\tau _{1}}, \ldots , {\tau _{K}}\) given a parameter value \(v_{}\) is:

$$\begin{aligned} P({\tau _{1}}, \ldots , {\tau _{K}} | v_{}) = \prod _{{\mathbf {x}}_i\in \mathcal {N}_K({\mathbf {x}})} \prod ^{M_i}_{m=1} \frac{v_{{\tau _{i}} (m)}}{\sum _{j=m}^{M_i} v_{{\tau _{i}} (j)}}. \end{aligned}$$
(11)

We can then use maximum likelihood estimation (MLE) in order to determine \(v_{}\) from \({\tau _{1}}, \ldots , {\tau _{K}}\), by maximizing (11)—or equivalently, its logarithm

$$\begin{aligned} l(v_{}) = \sum ^K_{i=1} \sum ^{M_i}_{m=1} \left[ \log ({v_{{\tau _{i}} (m)}}) - \log {\sum _{j=m}^{M_i}v_{{\tau _{i}} (j)}} \right] . \end{aligned}$$

Various ways to obtain this maximum have been investigated. We will use here the minorization-maximization (MM) algorithm [16], which aims, in each iteration, to maximize a function which minorizes the log-likelihood:

$$\begin{aligned} Q_k(v_{}) = \sum ^K_{i=1} \sum ^{M_i}_{m=1} \left[ \log ({v_{{\tau _{i}} (m)}}) - \frac{\log {\sum _{j=m}^{M_i}v_{{\tau _{i}} (j)}}}{\log {\sum _{j=m}^{M_i}v_{{\tau _{i}} (j)}^{(k)}}} \right] \end{aligned}$$

where \(v_{}^{(k)}\) is the estimation of \(v_{}\) in the k-th iteration. When the parameters are fixed, the maximization of \(Q_k\) can be solved analytically and the algorithm provably converges to the MLE estimate \(v_{}^*\) of \(v_{}\). The best ranking \({\tau _{}}^*\) is then

it is simply obtained by ordering the labels according to \(v_{}^*\) (see Eq. (2)).

Imprecise Predictions. An IPL model is in one-to-one correspondence with an imprecise parameter estimate, which can be obtained here by extending the classical likelihood to the contour likelihood method [5]. Given a parameter space \(\Sigma \) and a positive likelihood function L, the contour likelihood function is:

$$\begin{aligned} L^*(v_{}) = \frac{L(v_{})}{\max _{v_{}\in \Sigma } L(v_{})} ; \end{aligned}$$

by definition, \(L^*\) takes values in ]0, 1]: the closer \(L^*(v_{})\) is to 1, the more likely \(v_{}\) is. One can then naturally obtain imprecise estimates by considering “cuts”. Given \(\beta \) in [0, 1], the \(\beta \)-cut of the contour likelihood, written \(B^*_\beta \), is defined by

Once \(B^*_\beta \) is determined, for any test instance \({\mathbf {x}}\) to be processed, we can easily obtain an imprecise prediction \(\hat{y}\) in the form of a partial ranking, using the results of Sect. 3.1: we will retrieve \(\hat{y}\) such that \({\lambda _{i}} \succ {\lambda _{j}}\) for all \(v_{k} \in B^*_\beta \). We stress here that the choice of \(\beta \) directly influences the precision (and thus the robustness) of the model: \(B^*_1 = v_{}^*\), which generally leads to a precise PL model; when \(\beta \) decreases, the IPL model is less and less precise, possibly leading to partial (and even empty) predictions.

In our experiments, the contour likelihood function is modelled by generating multiple strengths \(v_{}\) according to a Dirichlet distribution with parameter \(\beta = \gamma v_{}^*\), where \(v_{}^*\) is the ML estimate obtained with the best PL model (or equivalently, the best strength \(v_{}\)) and is a coefficient which makes it possible to control the concentration of parameters generated around \(v_{}^*\).

4.2 Evaluation

When the observed and predicted rankings y and \(\hat{y}\) are complete, various accuracy measures  [15] have been proposed to measure how close they are to each other (0/1 accuracy, Spearman’s rank, ...). Here, we retain Kendall’s Tau:

$$\begin{aligned} A_{\tau }(y,\hat{y})=\frac{C-D}{\nicefrac {n(n-1)}{2}} , \end{aligned}$$
(12)

where C and D are respectively the number of concording and discording pairs in y and \(\hat{y}\). In the case of imprecise predictions \(\hat{y}\), the usual quality measures can be decomposed into two components [9]: correctness (CR), measuring the accuracy of the predicted comparisons, and completeness (CP):

$$\begin{aligned} CR(y,\hat{y})=\frac{C-D}{C+D} \quad \text {and} \quad CP(y,\hat{y})=\frac{C+D}{\nicefrac {n(n-1)}{2}},\end{aligned}$$
(13)

where C and D are the same as in Eq. (12). Should \(\hat{y}\) be complete, \(C+D=\nicefrac {n(n-1)}{2}\), \(CR(y,\hat{y})=A_{\tau }(y,\hat{y})\) and \(CP(y,\hat{y})=1\); while \(CR(y,\hat{y})=1\) and \(CP(y,\hat{y})=0\) if \(\hat{y}\) is empty (since no comparison is done).

4.3 Results

We performed our experiments on several data sets, mostly adapted from the classification setting [7]; we report here those obtained on the Bodyfat, Housing and Wisconsin data sets. For each dataset, we tested several numbers of neighbours: \(K \in \{5, 10, 15, 20\}\) (for the MLE estimate and using Eq. (12)), and chose the best by tenfold cross-validation. The sets of parameters \(B^*_\beta \) were obtained as explained above, by generating 200 strengths with \(\gamma \in \{1,10\}\), the best value being selected via tenfold cross validation repeated 3 times.

We also compared our approach to another proposal [8] based on a rejection threshold of pairwise preference probabilities, in three different configurations:

  • using the original, unperturbed rankings;

  • by deleting some labels in the original rankings with a probability \(p \in [0,1]\);

  • by introducing some noise in the rankings, by randomly swapping adjacent labels with a probability \(p \in [0,1]\) (the labels being chosen at random).

Figure 3 displays the results of both methods for the Bodyfat data set (with \(m=252\) and \(n=7\)) when rankings remain unperturbed, with a confidence interval of \(95\%\) (\(\pm 2\) standard deviation of measured correctness). Our approach based on the contour likelihood function is on par with the method based on abstention, which was the case with all tested data sets. Both methods see correctness increase once we allow for abstention. On the other data sets, the same behaviour can be seen: our approach seems to be on par with the one based on abstention, provided that the contour likelihood function has been correctly modelled (i.e., the generation of strengths is appropriate).

Fig. 3.
figure 3

Comparison of methods on Bodyfat with no perturbations

In order to be able to compare the two methods, we show underneath results on a specific range of the completeness. We only show the domain [0.6, 1]; however the behaviour is similar outside this range.

Figures 4 and 5 show that both methods are also on par on the Housing data set (\(m=506\), \(n=6\)) even when the data sets are missing some labels. It can also be noticed that for a given completeness level, the correctness is lower than in the unperturbed case. On average, the greater the level of perturbation is, the lower the average correctness is. This also stands for the other data sets.

Fig. 4.
figure 4

Comparison of methods on Housing with no perturbations

Fig. 5.
figure 5

Comparison of methods on Housing with 60% of missing label pairs

Figures 6 and 7 display that with a different method of perturbation (label swapping), our approach gives similar results on the Wisconsin data set (\(m=194\), \(n=16\)). Moreover, the correctness is again lower in average for a given completeness level if the data set is perturbed. We observe the same behaviour for the label swapping perturbation method on the other data sets.

Fig. 6.
figure 6

Comparison of methods on Wisconsin with no perturbations

Fig. 7.
figure 7

Comparison of methods on Wisconsin with 60% of swapped label pairs

Such results are encouraging, as they show that we can at least achieve results similar to state-of-the-art approaches. We yet have to identify those cases where the two approaches significantly differ.

5 Conclusions

In this paper, we made a preliminary investigation into performing robust inference and making cautious predictions with the well known Plackett–Luce model, a popular ranking model in statistics. We have provided efficient methods to do so when the data at hand are poor, that is either of a low quality (noisy, partial) or scarce. We have demonstrated the interest of our approach in a label ranking problem, in presence of missing or noisy ranking information.

Possible future investigations may focus on the estimation problem, which may be improved, for example by extending Bayesian approaches  [14] through the consideration of sets of prior distributions; or by developing a natively imprecise likelihood estimate, for instance by coupling recent estimation algorithms using stationary distribution of Markov chains  [20] with recent works on imprecise Markov chains  [17].

As suggested by an anonymous reviewer, it might be interesting to consider alternatives estimation methods such as epsilon contamination. There already exist non-parametric, decomposition-based approaches to label ranking with imprecise ranks; see  [4, 11]. However, the PL model, being tied to an order representation, may not be well-suited to such an approach. We intend to investigate this in the future.

Last, since the Plackett–Luce model is known to be strongly linked to particular RUM models  [2, 24], it may be interesting to investigate what becomes of this connection when the RUM model is imprecise (for instance, in our case, by considering Gumbel distributions specified with imprecise parameters).