1 Introduction

Many daily applications, such as meta-search, online product recommendation and ordinal peer grading, face the aggregation of k-ary preferences (Liu 2009; Volkovs et al. 2012; Raman and Joachims 2014). Namely, multiple k-ary preferences over objects are aggregated as a L-ary preference (\(k \ll L\)). Meanwhile, crowdsourcing has become a trendy way to provide a plethora of k-ary preferences for this aggregation problem (Venanzi et al. 2014) due to two reasons: first, it is convenient to collect preferences on crowdsourcing platforms; second, the cost of crowdsourced preferences is very low.

However, k-ary preferences from crowdsourced workers are often noisy (Sheng et al. 2008; Snow et al. 2008; Ok et al. 2016), which adversely affects the robustness of traditional aggregation models (i.e., permutation-based or score-based models). Although the state-of-the-art CrowdBT (Chen et al. 2013) aggregates pairwise crowdsourced preferences effectively, it cannot model the characteristics of k-ary preferences seamlessly. Specifically, CrowdBT splits each k-ary preference into multiple pairwise preferences before the aggregation, which may lead to cyclical inconsistency (\(a > b\) and \(b > a\) or \(a > b\), \(b > c\) and \(c > a\)).Footnote 1 In addition, traditional inferences make aggregation models impractical for large-scale preferences. For example, when the number of preferences is large (e.g., 500,000 preferences), the Gibbs sampling approach is too slow to infer the permutation-based model easily (Lu and Boutilier 2011). Therefore, for large-scale k-ary crowdsourced preferences, can we build a robust aggregation model with an efficient inference under the single ground-truth assumption?Footnote 2

In this paper, we present a RObust PlAckett–Luce (ROPAL) model coupled with an efficient online inference, which is tailor-made for large-scale k-ary crowdsourced preferences. To ensure the robustness, ROPAL integrates the traditional Plackett–Luce model with a denoising vector. Based on the Kendall-tau distance, this vector corrects k-ary crowdsourced preferences with a certain probability. Specifically, each denoising vector is actually a distribution that matches with an unique worker quality. For an expert, the first entry of her denoising vector, denoting the probability that the preferences accord with the ground truth, is close to 1. That is to say, we trust all her preferences. For an amateur, since she only makes mistakes on comparable objects during the annotation process, the first two entries of her denoising vector dominate the distribution (the denoising vector). Therefore, ROPAL corrects her preferences with the probability revealed by the first or second entry of her denoising vector. For a spammer, as she annotates the preferences carelessly, each entry of her denoising vector is nearly \(\frac{1}{C_k^2+1}\). Therefore, the likelihood of the preferences from each spammer is a constant, which means ROPAL discards all her preferences. For a malicious worker, ROPAL learns her denoising vector by exploring the pattern she corrupts the preferences. For example, if a malicious worker annotates the preferences in exactly an opposite order, the last entry of her denoising vector will be close to 1. To sum up, ROPAL reveals the ground truth of her preferences according to each entry of her estimated denoising vector. To ensure the scalability of ROPAL, we adopt the online learning strategy for the Bayesian extension of ROPAL and resort to moment-matching methods to update its hyperparameters incrementally. All hyperparameters are updated by closed-form solutions, which makes ROPAL scalable to large-scale preferences. Our contributions are as follows.

  • We present a RObust PlAckett–Luce (ROPAL) model to aggregate k-ary preferences directly, which avoids the issue of cyclical inconsistency.

  • We introduce a denoising vector in ROPAL to deal with crowdsourced (noisy) preferences and detect noisy workers simultaneously. We design this vector based on Kendall-tau distance, which avoids the curse of permutation-space partition.

  • We propose an online Bayesian inference, which makes ROPAL scalable to large-scale preferences.

  • We conduct comprehensive experiments on “large-scale simulated” and “real-world” datasets. Empirical results demonstrate that ROPAL with the online Bayesian inference achieves substantial improvements in robustness and noisy worker detection over current approaches.

The remainder of this paper is organized as follows. In Sect. 2, the related work is presented. Section 3 is the core of this paper. Initially, we introduce the k-ary crowdsourced preferences setting. Then, for this setting, we propose a RObust PlAckett–Luce (ROPAL) model, which integrates the Plackett–Luce model (for k-ary preferences) with a denoising vector (for robustness). Lastly, we present an online Bayesian inference to make ROPAL scalable to large-scale preferences. In Sect. 4, we provide the experiment setup and empirical results related to simulated and real-world experiments. The conclusive remarks and future works are given in Sect. 5.

2 Related work

Our work is partially related to a novel ranking problem, namely, aggregating multiple k-ary preferences into a consensus L-ary preference, where \(k \ll L\). This ranking problem has a wide range of practical applications. For example, in ordinal peer grading (Raman and Joachims 2014), each grader returns an ranking over her partial assignments, and her ranking is regarded as an ordinal feedback. All ordinal feedbacks from different graders are aggregated to yield the final grades over full assignments. Meta-search with the goal of merging multiple partial rankings from different sources to output a single full ranking, is another example (Volkovs et al. 2012). For online product recommendation (Lv et al. 2011), the system in eBay or Amazon recommends the most related products to each customer according to her browsing history. Namely, multiple ordinal records are aggregated as a single consensus recommendation list. Our work is also related to crowdsourcing (Deng et al. 2016; Krishna et al. 2017), which provides sufficient preferences for this ranking problem.

When aggregating k-ary crowdsourced preferences, two crucial challenges should be noted: preference noise and computational cost. To be specific, in crowdsourcing, online workers can be roughly categorized as domain experts and amateurs according to their performance. Due to the low reward, most crowdsourced preferences are from amateurs rather than domain experts. Therefore, crowdsourced preferences are often noisy (Kazai et al. 2011; Vuurens et al. 2011), which adversely affects the robustness of conventional aggregation models (i.e, score-based and permutation-based). In addition, traditional inferences, such as expectation-maximization (EM) and Gibbs sampling approaches, need multiple iterations to converge (Bishop 2006). Therefore, traditional inferences render aggregation models impractical for large-scale datasets in the real world. In the following, we list representative examples to explain these two challenges.

Among the score-based models, the Plackett–Luce model (Luce 1959; Plackett 1975) is suitable to aggregate k-ary preferences directly. However, k-ary crowdsourced preferences are full of noise, which very likely results in the remarkable degeneration of robustness in the Plackett–Luce model. Although CrowdBT (Chen et al. 2013) extends the Bradley and Terry (1952) model to aggregate pairwise crowdsourced preferences robustly, it is unable to deal with k-ary crowdsourced preferences directly. Specifically, CrowdBT splits each k-ary preference into multiple pairwise preferences before the aggregation, which leads to the issue of cyclical inconsistency. Among permutation-based models, the Mallows model (1957) tends to model the distribution of a full L-ary preference directly, but the exact inference for this distribution tends to be time-consuming or even intractable. Recent work has extended the Mallows model to define distributions of multiple k-ary preferences (Lu and Boutilier 2011, 2014). However, when the number of multiple k-ary preferences is large, the sampling approach leveraged by permutation-based models is typically slow, which makes these models impractical for large-scale preferences.

To sum up, in this paper, our target consists of two aspects. The first aspect is to propose a robust ranking model that aggregates k-ary crowdsourced preferences effectively. The second aspect is to present an efficient inference that makes the proposed model scalable to large-scale preferences.

3 Aggregating with k-ary crowdsourced preferences

We begin this section with k-ary crowdsourced preferences setting. Then, we introduce the Plackett–Luce (PL) model and the crowdsourced Bradley–Terry (CrowdBT) model, respectively. Meanwhile, we discuss their deficiencies on the aggregation of k-ary crowdsourced preferences. Furthermore, we present our RObust PlAckett–Luce (ROPAL) model, which avoids the above deficiencies. Specifically, ROPAL introduces a denoising vector to effectively deal with crowdsourced (noisy) preferences and detect noisy workers. Lastly, we propose an online Bayesian inference to make ROPAL scalable to large-scale preferences.

Table 1 Common mathematical notations
Fig. 1
figure 1

Comparison between two settings. Upper panel: In traditional crowdsourced setting, worker w annotates the whole set to yield a full or partial preference. The gray area denotes 2 remaining objects \((o_1,o_5)\), which are assumed to be ranked lower. Lower panel: In k-ary crowdsourced preferences setting, worker w annotates multiple subsets to yield several partial preferences. \(O_{n,w}\) means the nth k-ary preference annotated by worker w, and worker w provides \(N_w\) partial preferences in total. Note that: (1) the notation W in the corner means W workers complete the annotation process independently and repeatedly; (2) the subset with “\(\surd \)” is selected by worker w

3.1 k-Ary crowdsourced preferences setting

In Table 1, we first illustrate the common mathematical notations that are used later. Then we compare traditional crowdsourced preference setting with k-ary crowdsourced preferences setting in Fig. 1. Specifically, in traditional preference setting, a finite set of objects \(\Omega = \{o_1,o_2,\ldots ,o_L\}\) is assigned to W workers. Each worker w ranks the whole set \(\Omega \) independently to yield only one preference \(O_w\) Footnote 3 (the full preference in the upper panel of Fig. 1) according to a certain criterion, such as personal preferences or attitudes.

However, the total number of objects L tends to be very large in many real-world applications (Shah et al. 2013; Luaces et al. 2015). Therefore, the annotation process cannot be carried out completely every time. Normally, worker w only annotates her most-confident l (\(< L\)) objects and leaves the remaining \(L - l\) objects undefined (the partial preference \(O_w\) in the upper panel of Fig. 1). Numerous researchers are working on this (partial) ranking aggregation problem (Dwork et al. 2001; Schalekamp and Zuylen 2009; Kolde et al. 2012). Most of their work assumes the remaining \(L - l\) objects are tacitly assumed to be ranked lower (Guiver and Snelson 2009; Mollica and Tardella 2017). Therefore, the partial preferences can be modelled by the classical Plackett–Luce models. However, the traditional (partial) ranking aggregation problem still faces the following problems: first, it is unrealistic to assume the rare objects are less important, especially for the large L ranking problem; second, lots of workers tend to annotate their most-confident l (\(< L\)) objects to ensure their accuracy, which will undoubtedly results in a lack of information contained in the collected preferences.

Ordinal peer grading, where students assess each other, is a promising approach to tackle the large-scale evaluation problem (Freeman and Parks 2010; Shah et al. 2013; Luaces et al. 2015; Kulkarni et al. 2015; Prpić et al. 2015). Motivated by ordinal peer grading, we propose our k-ary crowdsourced preferences setting, which overcomes the limitation when L is large (Alfaro and Shavlovsky 2014). Specifically, a large set of objects \(\Omega \) is broken into \(C_L^k\) distinct tasks \(\{\xi _i\}_{i=1}^{C_L^k}\) with \(|\xi _i|\equiv k\), where each task \(\xi _i\) is a subset of \(\Omega \). Then \(N_w\) of \(C_L^k\) tasks are assigned randomly to worker w to annotate, where \(\forall w\in \{1,2,\ldots ,W\}\) (the lower panel of Fig. 1). Taking into account the task difficulty, in real applications, the size of each task should be much smaller than the total number of objects (e.g., \(|\xi _i|\equiv k < 7 \ll |\Omega | = L\)). In the following, we aim to aggregate the k-ary preferences annotated by W workers into a full preference over all L objects.

3.2 Deficiency of classical models

Given a k-ary preference \(O{:}\,o_1>o_2> \cdots > o_k\),Footnote 4 it is suitable to be modeled by Classical Plackett–Luce model \(f_{PL}\) as follows:

$$\begin{aligned} f_{PL}(O|\mathbf {S})= \prod _{i=1}^{k-1} \frac{e^{S_{o_{i}}}}{ \sum _{t=i}^{k} e^{S _{o_{t}}}}, \end{aligned}$$
(1)

where \(\mathbf {S}= \{S_1,S_2, \ldots ,S_L\}\) denotes implicit scores associated with each object. Based on Eq. (1), for the collection of all k-ary crowdsourced preferences D, the likelihood function \(\Pr (D|\mathbf {S})\) can be represented as:

$$\begin{aligned} \Pr (D|\mathbf {S}) = \prod _{w=1}^W \Pr (D_w|\mathbf {S}) = \prod _{w=1}^W \prod _{n=1}^{N_w} f_{PL}( O _{n,w}|\mathbf {S}), \end{aligned}$$
(2)

where \(D_w\) is the collection of multiple k-ary preferences annotated by worker w. \(O _{n,w}\) is the nth preference annotated by worker w. More intuitive explanations can be found in the lower panel of Fig. 1. Note that, the traditional Plackett–Luce model [Eq. (2)] does not consider worker quality. Therefore, this model treats preferences from different workers equally. However, when preferences are derived from crowdsourcing platforms, these preferences annotated by amateurs are often noisy, which inevitably degrades the aggregation performance of the traditional Plackett–Luce model.

Chen et al. (2013) proposed a pairwise model (CrowdBT) to aggregate pairwise crowdsourced references robustly, since CrowdBT considers the worker quality. However, the pairwise model handles k-ary preferences clumsily for two reasons. First, pairwise models cannot model the characteristics of k-ary preferences seamlessly. Second, pairwise models splits each k-ary preference into multiple pairwise preferences before the aggregation, which may lead to the second type of cyclical inconsistency. Assume that worker w has a unique worker quality \(\eta _w\), namely the probability that the pairwise preference \(o_a >_{w} o_b\) annotated by worker w accords with the true preference \(o_a >o_b\): \(\eta _w = \Pr (o_a>_{w} o_b \mid o_a >o_b)\). Therefore, according to the law of total probability, we have:

$$\begin{aligned} \begin{aligned} \Pr (o_a>_{w} o_b)&=\Pr (o_a>_{w} o_b \mid o_a>o_b)\Pr (o_a>o_b)\\&\quad +\,\Pr (o_a>_{w} o_b \mid o_a<o_b)\Pr (o_a<o_b)\\&=\eta _{w}\Pr (o_a>o_b) + (1-\eta _{w})\Pr (o_a<o_b). \end{aligned} \end{aligned}$$

Assume that we have a k-ary subset \((o_1,o_2, \ldots , o_{k})\), which is annotated as \(O_{n,w}{:}\,o_1>_{w} o_2>_{w} \cdots > _{w} o_{k}\) by worker w. A pairwise model will first split this k-ary preference into \(C_k^2\) pairwise preferences \(o_i>_{w}o_j\), where \((i,j) \subset \{1,2,\ldots ,k\}\) and \(i \ne j\). Then, the model aggregates these pairwise preferences using worker quality \(\eta _w\). Here, we take \(k=3\) as an example. The 3-ary preference can be modeled as:

$$\begin{aligned} \begin{aligned} \Pr \left( o_a>_{w} o_b>_{w} o_c\right)&=\Pr (o_a>_{w} o_b)\Pr (o_a>_{w} o_c)\Pr (o_b>_{w} o_c)\\&=\prod _{(i,j)}\left\{ \eta _{w}\Pr (o_i>o_j) + (1-\eta _{w})\Pr (o_i<o_j)\right\} , \end{aligned} \end{aligned}$$

where \((i,j) \in \{(a,b), (a,c), (b,c)\}\). In this case, we identify two cyclical inconsistencies, namely \(\Pr (o_a>o_b)\Pr (o_c>o_a)\Pr (o_b>o_c)\) and \(\Pr (o_b>o_a)\Pr (o_a>o_c)\Pr (o_c>o_b)\). This leads to the deficiency of pairwise aggregation models. Specifically, if the worker quality \(\eta _w\) is 0.8, the coefficients \(\eta _{w}^2(1-\eta _{w})\) and \(\eta _{w}(1-\eta _{w})^2\) of these two cyclical inconsistencies are 12.8 and 3.2%, respectively. According to Theorems 1 and 2, the cyclical inconsistency becomes more serious with the increase of k. In Theorem1, we first prove that the number of cyclical inconsistencies increases with k.

Theorem 1

Assume that k-ary preferences are aggregated by a pairwise model, then the number of cyclical inconsistencies caused by a k-ary preference is equal to \(2^{C_k^2} - k!\).

Remark 1

The number of cyclical inconsistencies caused by a 6-ary preference is 32,048.

However, Theorem 1 does not imply that the probability of occurrence of cyclical inconsistencies increases with k in reality. Thus, in Theorem 2, we should prove that the proportion of the probability that cyclical inconsistencies happen in a k-ary preference increases along with k.

Theorem 2

Under the same assumption of Theorem 1, the likelihood \(\Pr (o_1>_{w} o_2>_{w} \cdots > _{w} o_k)\) can be decomposed into \(2^{C_k^2}\) distinct combinations of pairwise preferences. Let \(\Pr (o_1>_{w} o_2>_{w} \cdots > _{w} o_k) =\Pr _{k}(\mathbb {I}) + \Pr _{k}(\mathbb {C})\), where \(\Pr _{k}(\mathbb {I})\) is the probability of occurrence of inconsistent combinations, and \(\Pr _{k}(\mathbb {C})\) is the probability of occurrence of consistent combinations. Then \( \frac{\Pr _{k+1}(\mathbb {I})}{\Pr (o_1>_{w} o_2>_{w} \cdots> _{w} o_{k+1})}> \frac{\Pr _{k}(\mathbb {I})}{\Pr (o_1>_{w} o_2>_{w} \cdots > _{w} o_{k})}\), which means the proportion of the probability that the cyclical inconsistencies happen in a k-ary preference increases along with k.

The proofs related to Theorems 1 and 2 can be found in the “Appendix”. The above analysis motivates us to integrate the Plackett–Luce model with the worker quality, which not only avoids the second type of cyclical inconsistencies in nature, but also improves the robustness for k-ary crowdsourced preferences setting.

3.3 Robust Plackett–Luce model

A k-ary preference \(O_{n,w}\) has k! distinct permutations \(\{g_i (O_{n,w}) \}_{i=1}^{k!}\), which constitutes a finite partition of the entire sample space (see Fig. 2). Each permutation \(g_i (O_{n,w})\) corresponds to one potential ground truth of \(O_{n,w}\). Inspired by CrowdBT, it is intuitive to leverage the original partition (see the left panel of Fig. 2) directly, and assign a denoising vector \(\mathbf {\eta _w}\) for each worker to define the k! conditional probabilities of \(O_{n,w}\) given permutation \(g_i (O_{n,w})\ \forall i \in \{1,2,\ldots , k!\}\). To be specific, let \(\eta _w^{i} = \Pr \left( O_{n,w} \mid g_i (O_{n,w})\right) \ \forall i \in \{1,2,\ldots , k!\}\), denoting the conditional probability of the observation \(O_{n,w}\) given permutation \(g_i (O_{n,w})\) being the ground truth. Therefore, according to the law of total probability, we have:

$$\begin{aligned} \Pr (O_{n,w}\mid \mathbf {S},\mathbf {\eta })&= \Pr (O_{n,w}\mid \mathbf {S},\mathbf {\eta }_w) = \sum _{i=1}^{k!}\Pr \left( O_{n,w} \mid g_i (O_{n,w})\right) f_{PL}\Big ( g_i (O_{n,w})\mid \mathbf {S}\Big )\nonumber \\&= \sum _{i=1}^{k!}\eta _w^{i} f_{PL}\Big ( g_i (O_{n,w})\mid \mathbf {S}\Big ), \end{aligned}$$
(3)

where \(\mathbf {\eta }\) denotes \(\{\mathbf {\eta }_w\}_{w=1}^W\), and \(\mathbf {\eta }_w = [\eta ^1_w, \eta ^2_w,\ldots , \eta ^{k!}_w]\) represents the denoising vector for worker w. \(\mathbf {S}= \{S_1,S_2, \ldots ,S_L\}\) denotes implicit scores associated with all objects. The first equation holds because the preference \(O_{n,w}\) only depends on worker w’s denoising vector \(\mathbf {\eta }_w\). We use \(f_{PL}\) to represent the Plackett–Luce model [Eq. (1)].

However, the original partition used in Eq. (3) leads to the curse of permutation-space (p-space) partition. To be specific, since the number of permutations k! increases dramatically with the length of preference k, the approach [Eq. (3)] requires the estimation of parameter \(\mathbf {\eta _w}\) with length k! for each worker w. Each entry \(\eta _w^i\) corresponds to each permutation transformation function \(g_i ()\). The high dimension of the denoising vector results in the heavy burden of parameter estimation. Therefore, we need to design a clever and practical partition to avoid the curse of p-space partition. Let us further consider the different levels of workers. For an expert, most of her preferences are in accord with the ground truth and we can ignore other possible permutations in the sample space. For an amateur, she usually makes mistakes on comparable objects, which are ranked at abutting positions in her preferences. Therefore, it is intuitive to measure the performance of each worker by the mistakes that she makes on comparable objects, instead of disparate objects. This approach is consistent with the Kendall-tau distance (Kendall 1948; Diaconis 1988), which counts the number of pairwise disagreements between two preferences. Furthermore, for a given preference \(O_{n,w}\), we observe that: (1) not all possible permutations are equally important; (2) the larger the Kendall-tau distance of a permutation to the preference \(O_{n,w}\), the lower the probability that this permutation becomes the ground truth. Overall, we consider the partition of the sample space with the Kendall-tau distance (see the right panel of Fig. 2). Specifically, Let \(\eta _w^{r} = \Pr \left( O_{n,w}\mid G_r(O_{n,w})\right) \ \forall r \in \{0,1,\ldots ,C_k^2\}\), where \(G_r(O_{n,w}) = \{g_i (O_{n,w})\mid Kt\left( g_i (O_{n,w}), O_{n,w}\right) = r), i = 1,2,\ldots , k!\}\) denotes the set of possible permutations whose Kendall-tau distance Kt(, ) to preference \(O_{n,w}\) is r. For example, when \(r=0\), we have \(G_0(O_{n,w}) = O_{n,w}\), denoting that preference \(O_{n,w}\) accords with the ground truth. Benefiting from this approach, the length of the denoising vector \(\mathbf {\eta }_w\) reduces from k! [Eq. (3)] to \(C_k^2 +1\) [Eq. (4)]. Finally, our proposed ROPAL for k-ary preference aggregation can be represented as follows:

$$\begin{aligned} \max _{\mathbf {S},\mathbf {\eta }}\prod _{w=1}^W \prod _{n=1}^{N_w}\sum _{r=0}^{C_k^2}\eta _w^{r} f_{PL}\Big (G_r(O_{n,w})\mid \mathbf {S}\Big ), \end{aligned}$$
(4)

where \(\mathbf {\eta }\equiv \{\mathbf {\eta }_w\}_{w=1}^W\) and \(\mathbf {\eta }_w = [{\eta }^0_w, {\eta }^1_w,\ldots ,{\eta }^{C^2_k}_w]\) represents the denoising vector for worker w. \(f_{PL}\Big (G_r(O_{n,w})\mid \mathbf {S}\Big )\) is the likelihood of a set of permutations \(g_i(O_{n,w}) \ \forall i \in \{1,2,\ldots , k!\}\) with the Kendall-tau distance to the preference \(O_{n,w}\) being r. For example, we have \(f_{PL}\Big (G_0(O_{n,w})\mid \mathbf {S}\Big ) = f_{PL}(O_{n,w}\mid \mathbf {S})\), which can be modelled by the Plackett–Luce model [Eq. (1)].

Fig. 2
figure 2

Different partitions of sample space. Given a preference \(O_{n,w}{:}\,o_a>_w>o_b>_w>o_c\), its 6 distinct permutations constitute the entire sample space. Left Panel: The intuitive partition inspired by CrowdBT. Right Panel: Kendall-tau based partition. For notational simplicity, we use abc to represent \(o_a>o_b>o_c\)

According to \(\eta _w^{r} = \Pr \left( O_{n,w}\mid G_r(O_{n,w})\right) \ \forall r \in \{0,1,\ldots ,C_k^2\}\), we have: (1) the larger the r, the more dissimilar the preference and its corresponding ground truth; (2) the larger the \(\eta _w^r\), the higher probability the permutation, whose Kendall-tau distance to preference \(O_{n,w}\) is r, happens to be the ground truth.

From the perspective of workers, for an expert w whose preferences accord with the ground truth, we set \(\eta _w^{0}\approx 1\) and \(\eta _w^{r}\approx 0 \ \forall r \ge 1\), which means we trust all preferences annotated by worker w. For an amateur w, who only makes mistakes on comparable objects, we let \(\eta _w^{0}+\eta _w^{1}\approx 1\) and \(\eta _w^{r}\approx 0 \ \forall r \ge 2\). Thus, ROPAL will trust worker w with the probability based on \(\eta _w^{0}\), or corrects her preferences with the probability revealed by \(\eta _w^{1}\). For a spammer w, she annotates the preferences carelessly, hence we define \(\eta _w^{r} = \frac{1}{C_k^2+1}\) \( \forall r\in \{0,1,\ldots ,C_k^2\}\) and the likelihood of any preference from worker w is a constant (\(\frac{1}{C_k^2+1}\)). That is to say, our model discards all the preferences annotated by worker w. For a malicious worker w who provides incorrect preferences deliberately, we observe the pattern that she corrupts the preferences and assign a suitable value for \(\eta _w\). To be specific, if she always provide inverted preferences, we set \(\eta _w^{C_k^2} \approx 1\) and ROPAL reveals the ground truth of these preferences.

From the perspective of the denoising vector, large \(\eta _w^{0}\) indicates worker w is considered as an expert, who often provides the correct preferences. For large \(\eta _w^{1}\), ROPAL treats worker w as an amateur, who makes wrong decisions on comparable objects occasionally. Furthermore, when \(\eta _w^{r}\) \((r\ge 2)\) is large, she is most likely to be a malicious worker, who makes mistakes on easy tasks intentionally. In general, ROPAL will explore the worker quality for each worker w, and choose to trust her preferences based on \({\eta }^0_w\) or reveal the possible ground truth according to \({\eta }^r_w \forall r \ge 1\). Therefore, ROPAL takes the worker quality into consideration, and becomes more robust than traditional PL models. According to our analysis of \(\mathbf {\eta }_w\), ROPAL can identify noisy (non-expert) workers whose \(\eta _w^{0}\) is small.

In a nutshell, the parameter of worker quality is extended from a real number in CrowdBT to a real vector in our proposed ROPAL. In term of worker categorization, ROPAL has a better advantage in distinguishing worker quality. Specifically, CrowdBT can only distinguish noisy workers from experts; while ROPAL can make a finer-grained categorization of noisy (non-expert) workers into amateur, spammer and malicious workers. As a result, ROPAL achieves the better performance in robustness and noisy worker detection.

3.4 Online Bayesian inference

To estimate the parameters in ROPAL efficiently [Eq. (4)], we resort to an online learning strategy. First, we extend ROPAL to a Bayesian framework and adopt the moment-matching approach to update the hyperparameters incrementally. For simplicity, we assume independent prior distributions for each parameter. Specifically, we assume that \(\mathbf {S} =(S_1, S_2, \ldots , S_L)\) is a random vector representing the scores for each object, and each entry of \(\mathbf {S}\) is independent. Then, we introduce the standard Gaussian prior for each score \( S_l \sim N(0, 1)\ \forall l = 1,\ldots ,L\). Meanwhile, we introduce a prior for the denoising vector \(\mathbf {\eta }_w\), namely \(\mathbf {\eta }_w\sim Dir(\mathbf {\alpha }_w)\) with \(\mathbf {\alpha }_w = (\alpha _w^0, \alpha _w^1, \ldots , \alpha _w^{C_k^2})\) (Bishop 2006). How to set the prior hyperparameters \(\{\mathbf {\alpha }_w\}_{w=1}^W\) is explained in the experiment section. Here, the Bayesian version of ROPAL shares many similar characteristics with the Thurstonian model (Thurstone 1927; Maydeu-Olivares 1999), as they all adopt the single ground-truth assumption. However, there is one essential difference that the scores of objects estimated by ROPAL are completely independent of workers, while the Thurstonian model learns a worker-specific variance \(\sigma _{l,w}^2\) for each worker w.

Given a k-ary preference \(O_{n,w}\) annotated by worker w, according to the Bayesian Theorem, the intermediateFootnote 5 posterior can be represented by the likelihood and the intermediate prior:

$$\begin{aligned} \begin{aligned} \Pr (\mathbf {S},\mathbf {\eta }\mid O_{n,w})&= \frac{\Pr (O_{n,w}\mid \mathbf {S},\mathbf {\eta }_w)\Pr (\mathbf {S})\Pr (\mathbf {\eta })}{\Pr (O_{n,w})}\\&= \frac{\Pr (O_{n,w}\mid \mathbf {S},\mathbf {\eta }_w) \prod _{l=1}^L N(S_l ; \mu _l, \sigma _l^2)\prod _{v=1}^W Dir(\mathbf {\eta }_v\mid \mathbf {\alpha }_v)}{\Pr (O_{n,w})}. \end{aligned} \end{aligned}$$
(5)

The first equation holds because preference \(O_{n,w}\) only depends on worker w’s denoising vector \(\mathbf {\eta }_w\). Therefore, we have the equation \(\Pr (O_{n,w}\mid \mathbf {S},\mathbf {\eta }) \equiv \Pr (O_{n,w}\mid \mathbf {S},\mathbf {\eta }_w)\).

According to Eq. (3), the posterior density \(\Pr (\mathbf {S},\mathbf {\eta }\mid O_{n,w})\) is complicated or even intractable due to complex likelihood function \(\Pr (O_{n,w}\mid \mathbf {S},\mathbf {\eta })\). To make our online learning approach feasible, we adopt the mean-field assumption to approximate the posterior density in Eq. (5), namely:

$$\begin{aligned} \Pr (\mathbf {S},\mathbf {\eta }\mid O_{n,w}) \approx q(\mathbf {S})q(\mathbf {\eta }) = \prod _{l=1}^L N(S_l ; {\mu _l}^{new}, {\sigma _l^2}^{new})\prod _{v=1}^W Dir(\mathbf {\eta }_v\mid {\mathbf {\alpha }_v}^{new}). \end{aligned}$$

where we use the superscript \({}^{new} \) to denote the posterior parameters.

However, due to the complex likelihood function in Eq. (3), the moments \((\mathbf {\mu }, \mathbf {\sigma ^2})\) of score \(\mathbf {S}\) cannot be computed from a closed-form solution even with the mean-field variational approximation. Inspired by the Bayesian approximation method proposed in Weng and Lin (2011), we can estimate the variational parameters \((\mathbf {\mu }, \mathbf {\sigma ^2})\) of approximate posterior \(q(\mathbf {S})\) by differential operation instead of integral operation. Therefore, we extend our method to more general likelihood function satisfying twice differentiable.

Lemma 1

(Corollary 2 in Weng and Lin 2011) Let \(\mathbf {z}=(z_1, z_2, \ldots ,z_L)\) be a random vector, where each entry is independent and \(z_l \sim N(0,1)\), \(l = 1, 2, \ldots ,L\). Suppose that \(f(O\mid \mathbf {z} )\) is the likelihood function of observation O and f is almost twice differentiable. Then the mean and the variance of the posterior density \(\Pr (\mathbf {z}\mid O)\) given the observation O can be approximated as

$$\begin{aligned} E[\mathbf {z}]&\approx E\left[ \frac{\nabla _{\mathbf {z}} f(O\mid \mathbf {z} )}{f(O\mid \mathbf {z} )}\right] , \end{aligned}$$
(6a)
$$\begin{aligned} E[z_pz_q]&\approx \mathbf {1}_{pq} + E\left[ \frac{\nabla _{\mathbf {z}}^2 f(O\mid \mathbf {z} )}{f(O\mid \mathbf {z} )}\right] _{pq}, \quad p,q = 1,\ldots ,L \end{aligned}$$
(6b)

where \(\mathbf {1}_{pq} = 1\) if \(p = q\) and 0 otherwise.

Remark 2

Suppose that each object l in a ranking system is parametrized by the score \(S_l\), and assume that the prior distribution of \(S_l\) is \(N(\mu _l, \sigma _l^2)\). Upon the completion of one update, let O denote a k-ary preference and \(\mathbf {z}=(z_1, z_2, \ldots ,z_L)\), where \(z_l = \frac{S_l - \mu _l}{\sigma _l}\sim N(0,1)\ \forall l = 1,\ldots , L\), and L is the number of objects. \(l(\mathbf {z} )\) is the likelihood function of preference O and twice differentiable. Thus, the posterior density of \(z_l\) given the k-ary preference O is

$$\begin{aligned}&E[\mu _l] \approx {\mu _l} + \sigma _l E[z_l] \approx {\mu _l} + \sigma _l E\left[ \frac{\partial l(\mathbf {z})/\partial z_l}{l(\mathbf {z})}\right] , \end{aligned}$$
(7a)
$$\begin{aligned}&E[\sigma _l^2] \approx \sigma _l^2Var[z_l]=\sigma _l^2\left( E(z_l^2) - E^2(z_l)\right) \approx \sigma _l^2 \nonumber \\&\quad \left( 1+E\left[ \frac{\partial ^2 l(\mathbf {z})/\partial ^2 z_l}{l(\mathbf {z})}\right] -E\left[ \frac{\partial l(\mathbf {z})/\partial z_l}{l(\mathbf {z})}\right] ^2\right) , \end{aligned}$$
(7b)

where \(l= 1,2,\ldots ,L\).

For a k-ary preference \(O_{n,w}{:}\,o_1>_{w} o_2>_{w}\cdots >_{w}o_k\) annotated by worker w, we first update the moments of score \(\mathbf {S}\), then that of denoising vector \(\mathbf {\eta }_w\). To update \(\mathbf {S}\), we integrate \(\mathbf {\eta }\) out to obtain the intermediate marginalized likelihood \(l(\mathbf {S})\):

$$\begin{aligned} \begin{aligned} l(\mathbf {S})&= \idotsint \Pr (O_{n,w}\mid \mathbf {S},\mathbf {\eta }_w)\prod _{ v=1}^W Dir(\mathbf {\eta }_ {v}\mid \mathbf {\alpha }_ {v})d\mathbf {\eta }_1 \ldots d\mathbf {\eta }_W\\&= \int \Pr (O_{n,w}\mid \mathbf {S},\mathbf {\eta }_w)Dir(\mathbf {\eta }_ {w}\mid \mathbf {\alpha }_ {w})d\mathbf {\eta }_w. \end{aligned} \end{aligned}$$
(8)

The equality holds due to our assumption that each worker completes the annotation process independently. Therefore, we can integrate out another denoising vector \(\mathbf {\eta }_v(v \ne w)\). According to the definition of PL model in Eq. (1) and likelihood in Eq. (3), we have:

$$\begin{aligned} l(\mathbf {S}) = \sum _{r=0}^{C_k^2}E_{Dir(\mathbf {\eta }_w\mid \mathbf {\alpha }_w)}\left[ \eta _w^{r}\right] f_{PL}\Big (G_r(O_{n,w})\mid \mathbf {S}\Big )= \sum _{r=0}^{C_k^2}\frac{\alpha _{w}^{r}}{\sum _{t=0}^{C_k^2}\alpha _{w}^{t}} f_{PL}\Big (G_r(O_{n,w})\mid \mathbf {S}\Big ), \end{aligned}$$
(9)

where \(G_r(O_{n,w}) = \{g_i (O_{n,w})\mid Kt\left( g_i (O_{n,w}),O_{n,w}\right) = r), i = 1,2,\ldots , k!\}\), and \(f_{PL}\) represents the Plackett–Luce model [Eq. (1)].

Let \(\mathbf {z}=(z_1, z_2, \ldots ,z_L)\), where \(z_l = \frac{S_l - \mu _l}{\sigma _l}\sim N(0,1)\ \forall l = 1,\ldots , L\), and L is the number of objects. It is notable in Eq. (9) that \(l(\mathbf {S})\) only depends on the score of objects in preference \(O_{n,w}\) during each update. Therefore, based on our mean-field assumption, only the moments of objects in preference \(O_{n,w}\) need to be updated. According to Eq. (7a) in Remark 2, we update \(\mu _{o_i} \ \forall o_i \in \{o_1, o_2, \ldots , o_k\}\) as follows:

$$\begin{aligned} \mu _{o_i}^{new} \approx {\mu _{o_i}} + \sigma _{o_i} E\left[ \frac{\partial l(\mathbf {S})/\partial z_{o_i}}{l(\mathbf {S})}\right] \approx {\mu _{o_i}} + \sigma ^2_{o_i} \frac{\partial l(\mathbf {z})/\partial z_{o_i}}{l(\mathbf {z})}\vert _{\mathbf {z}=\mathbf {0}}, \end{aligned}$$
(10)

where we set \(\mathbf {z}=\mathbf {0}\) so that \(\mathbf {S}\) is replaced by \(\mathbf {\mu }\). Such an approximation is reasonable as we expect that the posterior density of \(\mathbf {S}\) to be concentrated on \(\mathbf {\mu }\) (Weng and Lin 2011).

According to Eq. (7b) in Remark 2, we update \(\sigma _{o_i}^2 \ \forall o_i \in \{o_1, o_2, \ldots , o_k\}\) as follows:

$$\begin{aligned} (\sigma _{o_i}^2)^{new}\approx & {} \sigma _{o_i}^2 \left( 1+E\left[ \frac{\partial ^2 l(\mathbf {S})/\partial ^2 z_{o_i}}{l(\mathbf {S})}\right] -E\left[ \frac{\partial l(\mathbf {S})/\partial z_{o_i}}{l(\mathbf {S})}\right] ^2\right) \nonumber \\\approx & {} \sigma _{o_i}^2 \max \left( {1+ \frac{\partial }{\partial z_{o_i}} \left( \frac{\partial l(\mathbf {S})/\partial z_{o_i}}{l(\mathbf {S})}\right) \vert _{\mathbf {z}=\mathbf {0}},\kappa _1 } \right) , \end{aligned}$$
(11)

where \(\kappa _1\) is a small positive value to ensure a positive \(\sigma ^2_{o_i}\).

We now update the moments of denoising vector \(\mathbf {\eta }_w\), and integrate \(\mathbf {S}\) out to obtain the intermediate marginalized likelihood \(l(\mathbf {\eta })\):

$$\begin{aligned} \begin{aligned} l(\mathbf {\eta })&= \idotsint \Pr (O_{n,w}\mid \mathbf {S},\mathbf {\eta }_w)\prod _{l=1}^L N(S_l ; \mu _l, \sigma _l^2)dS_1\ldots dS_L\\&= \idotsint \Pr (O_{n,w}\mid \mathbf {S},\mathbf {\eta }_w)\prod _{i=1}^k N(S_{o_i} ; \mu _{o_i}, \sigma _{o_i}^2)dS_{o_1}\ldots dS_{o_k}. \end{aligned} \end{aligned}$$

The equality holds due to our independent assumption about the score \(\mathbf {S}\). According to the definitions of likelihood in Eq. (3), we have:

$$\begin{aligned} \begin{aligned} l(\mathbf {\eta })&= E_{N(S_{o_1},S_{o_2},\ldots ,S_{o_k})}\left[ \Pr (O_{n,w}\mid \mathbf {S},\mathbf {\eta }_w)\right] = \sum _{r=0}^{C_k^2}\eta _w^{r}R_r, \end{aligned} \end{aligned}$$
(12)

where \(R_r = E\left[ f_{PL}\Big (G_r(O_{n,w})\mid \mathbf {S}\Big )\right] \). \(f_{PL}\Big (G_r(O_{n,w})\mid \mathbf {S}\Big )\) is the likelihood of the permutations \(g_i(O_{n,w}) \ \forall i \in \{1,2,\ldots , k!\}\) with the Kendall-tau distance to the preference \(O_{n,w}\) being r. We use \(f_{PL}\) to represent the Plackett–Luce model [Eq. (1)].

Furthermore, \(R_r\) can be approximated with the second-order Taylor approximation of \(f_{PL}\Big (G_r(O_{n,w})\mid \mathbf {S}\Big )\) at \(\mu _{o_i} \ \forall o_i \in \{o_1, o_2, \ldots , o_k\}\). Specifically,

$$\begin{aligned} R_r= & {} E\left[ f_{PL}\Big (G_r(O_{n,w})\mid \mathbf {S}\Big )\right] \\\approx & {} f_{PL}\big (G_r(O_{n,w})\mid \mathbf {\mu }\big ) + \frac{1}{2}\left( \sum _{t=0}^{C_k^2} \sigma _{o_t}^2\frac{\partial ^2}{\partial S_{o_t}^2}\Big (f_{PL}\big (G_r(O_{n,w})\mid \mathbf {S}\big )\Big )\right) \vert _{\mathbf {S}=\mathbf {\mu }}, \end{aligned}$$

Then we do the following normalized nonnegative transformation to \(R_r\),

$$\begin{aligned} R_r= \frac{max(R_r,\kappa _2)}{\sum _{t=0}^{C_k^2} max(R_t,\kappa _2) } \qquad r = 0,1,\ldots ,C_k^2, \end{aligned}$$

where \(\kappa _2\) is a small positive value to ensure a positive \(R_r\).

Let \(R = \Pr (O_{n,w})\) denote the likelihood of k-ary preference, we have:

$$\begin{aligned} R&= \Pr (O_{n,w})=\idotsint l(\mathbf {\eta })\prod _{ v=1}^W Dir(\mathbf {\eta }_ {v}\mid \mathbf {\alpha }_ {v})d\mathbf {\eta }_1\ldots d\mathbf {\eta }_W \nonumber \\&=\int l(\mathbf {\eta }_w) Dir(\mathbf {\eta }_ {w}\mid \mathbf {\alpha }_ {w})d\mathbf {\eta }_w\nonumber \\&= R_0E_{Dir(\mathbf {\eta }_w\mid \mathbf {\alpha }_w)}\left[ \eta _w^{0}\right] + R_1E_{Dir(\mathbf {\eta }_w\mid \mathbf {\alpha }_w)}\left[ \eta _w^{1}\right] + \cdots + R_{C_k^2}E_{Dir(\mathbf {\eta }_w\mid \mathbf {\alpha }_w)}\left[ \eta _w^{C_k^2}\right] \nonumber \\&=\frac{R_0\alpha _{w}^0+ R_1\alpha _{w}^1 + \cdots + R_{C_k^2}\alpha _{w}^{C_k^2}}{\sum _{t=0}^{C_k^2} \alpha _w^t}. \end{aligned}$$
(13)

As can be seen in Eq. (12), \(l(\mathbf {\eta })\) is only associated with \(\mathbf {\eta }_w\). Thus, \(l(\mathbf {\eta })\) degenerates to \(l(\mathbf {\eta }_w)\) for the third equation in Eq. (13). According to the Bayesian theorem, we have:

$$\begin{aligned} \Pr (\mathbf {\eta _w}\mid O_{n,w}) = \frac{l(\mathbf {\eta }_w)Dir(\mathbf {\eta }_w\mid \mathbf {\alpha }_w)}{\Pr (O_{n,w})}= \frac{l(\mathbf {\eta }_w)Dir(\mathbf {\eta }_w\mid \mathbf {\alpha }_w)}{R}. \end{aligned}$$

Above all, only the moments of denoising vector \(\mathbf {\eta }_w\) need to be updated during the current update. We compute \(E\left[ \eta _w^{r}\right] \) and \(E\left[ {(\eta _w^r)}^2\right] \), where \(r=0,\ldots ,C_k^2\) as follows:

$$\begin{aligned} E\left[ \eta _w^{r}\right]&= \int \eta _w^{r}\Pr (\mathbf {\eta }_w\mid O_{n,w})d\mathbf {\eta }_w = \frac{1}{R}\int \eta _w^{r}l(\mathbf {\eta }_w)Dir(\mathbf {\eta }_w\mid \mathbf {\alpha }_w)d\mathbf {\eta }_w\nonumber \\&=\frac{\alpha _w^{r}\left( \sum _{p=0}^{C_k^2}(R_p\alpha _p) + R_{r}\right) }{R\left( \sum _{q=0}^{C_k^2}\alpha _w^{q}+1\right) \left( \sum _{q=0}^{C_k^2}\alpha _w^{q}\right) },\end{aligned}$$
(14a)
$$\begin{aligned} E\left[ {(\eta _w^r)}^2\right]&= \int {(\eta _w^{r})}^2\Pr (\mathbf {\eta }_w\mid O_{n,w})d\mathbf {\eta }_w =\frac{1}{R}\int {(\eta _w^{r})}^2l(\mathbf {\eta }_w)Dir(\mathbf {\eta }_w\mid \mathbf {\alpha }_w)d\mathbf {\eta }_w\nonumber \\&=\frac{\alpha _w^{r}(\alpha _w^{r} + 1)\left( \sum _{p=0}^{C_k^2}(R_p\alpha _p) + 2R_{r}\right) }{R\left( \sum _{q=0}^{C_k^2}\alpha _w^{q}+2\right) \left( \sum _{q=0}^{C_k^2}\alpha _w^{q}+1\right) \left( \sum _{q=0}^{C_k^2}\alpha _w^{q}\right) }. \end{aligned}$$
(14b)

We then update the hyperparameters \(\alpha _w^{r}\) for denoising vector \(\mathbf {\eta }_w\):

$$\begin{aligned} {\left( \alpha _w^r\right) }^{new} = \frac{\left( E\left[ \eta _w^{r}\right] - E\left[ {(\eta _w^r)}^2\right] \right) E\left[ \eta _w^{r}\right] }{E\left[ {(\eta _w^r)}^2\right] -\left( E\left[ \eta _w^{r}\right] \right) ^2}. \end{aligned}$$
(15)

As a result of the closed-form update solutions, our online Bayesian inference can handle massive k-ary preferences. Our online Bayesian inference for scalable ROPAL is given in Algorithm 1.

figure a

4 Experiments

In this section, we perform experiments on large-scale simulated datasets to validate the robustness of ROPAL preliminarily (Sect. 4.1). Then, we conduct experiments on two real-world datasets, namely “PeerGrading” and “BabyFace”, to show the superiority of ROPAL in robustness and noisy worker detection (Sect. 4.2). Note that, for k-ary crowdsourced preferences setting, we reveal the deficiency of pairwise CrowdBT in Sect. 3.2. Meanwhile, we propose (general) k-ary ROPAL to solve this deficiency in Sect. 3.3. In this paper, we showcase the performance of ROPAL by assigning \(k = 3\). Theoretical details related to \(k = 3\) can be found in “Appendix”.

4.1 Simulated experiments

Datasets&metrics We generate simulated datasets in a similar way as the description in CrowdBT (Chen et al. 2013). Assume that there are L objects with ground-truth scores from 1 to L. Each k-ary subset, composed of k \((k \ll L)\) objects randomly, is labeled by W workers with different worker qualities \(\{\mathbf {\eta }_w\} _{w=1}^{W}\) following a Dirichlet distribution \(Dir(\mathbf {\eta }_w\mid \mathbf {\alpha }_w)\). According to the probability theory, when all the worker qualities are simulated by the same prior distribution, the average quality of workers can be estimated by hyperparameters \(\mathbf {\alpha }_0\) while retaining diversity among workers. Therefore, we control the quality of the data by choosing the proper hyperparameters \(\mathbf {\alpha }_0\). To compare with other models, we adopt the Wilcoxon–Mann–Whitney statistics (Yan et al. 2003) to evaluate the accuracy \(\frac{ \sum _{i,j}\mathbf {1}(r_i>r_j) \wedge \mathbf {1}(\mu _i>\mu _j) }{ \sum _{i,j}\mathbf {1}(r_i>r_j) }\) of different models, where \(r_i>r_j\) represents the ground truth preference between object ij and \(\mu _i\) is the expected value of the score for the ith object.

Parameter initialization For the hyperparameters \((\mu _l,\sigma ^2_l)\) of score \(S_l\), we assign a standard normal prior for \(S_l\), i.e. \(S_l\sim N(0,1)\forall l \in \{1,2,\ldots , L \}\) in all experiments. For the hyperparameter \(\mathbf {\alpha }_w\) of uncertainty vector \(\mathbf {\eta }_w\) \(\forall w \in \{1,2,\ldots , W \}\) in ROPAL, we initialize each \(\mathbf {\alpha }_w\) with 7 gold tasks with known true preferences. The same initialization method also applies to the hyperparameter \((\alpha _w, \beta _w)\) of worker quality \(\eta _w\) \(\forall w \in \{1,2,\ldots , W\}\) in CrowdBT. We set \(L=1000\), \(W = 600\), and assign each worker \(N_w \equiv 1000\) tasks. The number of generated preferences reaches \(W \times N_w = 6\times 10^5\), which cannot be tackled by general models with a batch learning strategy. In addition, we set \(T = 5\) in Algorithm 1 for all simulated experiments.

Robustness First, we showcase the robustness of ROPAL on four simulated datasets. Specifically, according to our analysis for four kinds of workers, hyperparameter \(\mathbf {\alpha }_0\) is set to (24, 3, 2, 1), (5, 4, 3, 2), (5, 5, 4, 4) and (2, 5, 4, 2), which yields “Expert”, “Amateur”, “Spammer” and “Malicious” settings, respectively. Furthermore, hyperparameter \(\mathbf {\alpha }_0=(24,3,2,1)\) denotes the probability that preferences from each worker will be in accord with the ground truth is almost \(80\%\). The four different settings also ensure the descending order of the average quality \(E(\eta _w^{0})\) of four simulated datasets. From Fig. 3, we make the following observations: (1) under the Expert setting, ROPAL has the best accuracy (above \(84.5\%\)) due to the highest quality of the dataset; (2) the accuracy of ROPAL decreases by \(10\%\) on the Amateur setting and further decreases to below \(65\%\) on the Spammer setting, denoting that the quality of dataset affects the accuracy of ROPAL; (3) it is interesting to note that ROPAL on the Malicious setting obtains comparable or even better accuracy than that on the Spammer setting. This is because ROPAL tends to discard the preferences from spammers, Whereas for malicious workers, ROPAL tries to explore the patterns that they corrupt the preferences and thereby gather more information from the dataset to inference; and (4) it is notable that the accuracy on all settings reaches a high level with less than \(50\%\) of data, then the accuracy fluctuates slightly with subsequent data. Therefore, for the large dataset, the accuracy will not be improved by running multiple passes through the dataset.

Fig. 3
figure 3

To showcase the robustness of ROPAL preliminarily, we provide the accuracy (%) on large-scale simulated datasets, which include “Expert”, “Amateur”, “Spammer” and “Malicious” settings, respectively. \(50\%\) means we run the experiment with half of the data, while \(500\%\) denotes that we run the experiment on the same dataset five times with a random order

Second, we further verify the robustness of ROPAL by comparing it with well-known models: (1) Bradley–Terry (BT), (2) Plackett–Luce (PL), and (3) CrowdBT (Chen et al. 2013). We make modifications to the above models for a fair comparison. (1) BT and CrowdBT cannot model the k-ary preference aggregation directly. Therefore, we split each k-ary preference as \(k(k-1)/2\) (full-pair) pairwise preferences. (2) Since our model is updated with an online strategy, we implement both BT and PL with an online strategy, which is much more efficient than a batch learning strategy.

As we can see in Fig. 4, (1) on all settings, ROPAL achieves the highest accuracies over other baselines; (2) on the Expert setting, all models achieve accuracy above \(84\%\), benefiting from the high quality of the dataset; (3) on the Amateur, Spammer and Malicious settings, the advantage of ROPAL and CrowdBT become noticeable. Since ROPAL and CrowdBT consider worker quality, they are more robust to crowdsourced preferences; (4) all PL-based models show minor improvements over corresponding BT-based models, because BT-based models tends to split k-ary crowdsourced preferences and cyclical inconsistency occurs; (5) on all settings, the accuracy of all models reaches their highest values with about \(50\%\) of the data, and then remains stable or decreases slowly with subsequent data. As the metric that we adopt to evaluate the accuracy is independent of all models, the decrease in accuracy indicates the four models are overfitting on the Expert and Spammer settings. Therefore, for small dataset, the performance may improve by running multiple passes through the data (as verified by our real experiment), while for large datasets, we only run through the dataset once to avoid overfitting.

Fig. 4
figure 4

To verify the robustness of ROPAL further, we provide the accuracy (%) comparison between ROPAL and well-known models: OnlineBT, OnlinePL and CrowdBT. \(50\%\) means we run the experiment with half of the data, while \(500\%\) denotes that we run the experiment on the same dataset five times with a random order. Note that, on the malicious setting, the accuracy of OnlinePL and OnlineBT are almost the same, thus their figures overlap

Computational cost Furthermore, we theoretically compare the computational costs of the four models in Table 2, which is verified by the empirical results on the Expert setting later (see Fig. 5). Note that: (1) the computational costs of all models are independent of the settings; and (2) since we implement the inference in an online-updating strategy, all models can be updated with one k-ary preference each time. From Table 2, we clearly see that two major processes contribute to the overall computational cost: namely splitting and updating. The theoretical analysis in Table 2 is consistent with our observations in Fig. 5: (1) the computational cost of CrowdBT is much higher than that of the other models. Table 2 provides two reasons. First, CrowdBT splits k-ary preferences as \(C_k^2\) pairwise preferences for input. Second, CrowdBT needs to update worker quality compared to OnlineBT; and (2) the computational cost of OnlineBT is slightly higher than that of ROPAL. From Table 2, we see that although the updating process dominates the overall computational cost, the splitting process in OnlineBT still affects the overall computational cost.

Table 2 Computational cost of ROPAL and other baselines
Fig. 5
figure 5

To verify the efficiency of ROPAL, we measure the time taken by the four models when we conduct the experiment on the Expert setting. The number of preferences reaches \(W \times N_w \times T = 3\times 10^6\). Experiments are implemented on a Linux cluster with a 3.47 GHz CPU and 11.7 GB memory

4.2 Real-world experiments

In this section, we first conduct experiments on the “PeerGrading” dataset to verify the robustness of ROPAL in real-world situations. More importantly, we then perform experiments on the “BabyFace” dataset to show the superiority of ROPAL in terms of robustness and noisy worker detection.

4.2.1 PeerGrading

Datasets We employ two peer grading datasets introduced by Raman and Joachims (2014), which are uniformly called “PeerGrading”. The PeerGrading dataset was collected as part of a senior-undergraduate and masters-level class. It consists of the Poster (PO) dataset and the Final Report (FR) dataset. To be specific, there are 42 assignments (objects), 148 students (workers) and 7 TAs participated in the PO dataset. For the FR dataset, we have 44 assignments, 153 students and 9 TAs. This size of class is attractive, since it is large enough for collecting a substantial number of peer grades, meanwhile, it allows teaching assistant (TA) grading to serve as the ground truth.

Parameter initialization For the hyperparameter \(\mathbf {\alpha }_w\) in ROPAL, we have no access to the gold preferences because the average number of preferences annotated by each worker is too small (6 for PO and 2 for FR). However, as demonstrated in Raman and Joachims (2014), most students are high-quality (expert) workers in PO and FR datasets. According to our analysis in Sect. 3.3, for an expert worker w, the first entry of her denoising vector dominates the distribution (the denoising vector \(\mathbf {\alpha }_w\)). Therefore, we have \(E[\eta _w^0] = \frac{\alpha _w^0}{ \sum _r \alpha _w^r}\), large \(\eta _w^{0}\) indicates worker w is an expert, who often provides the correct preferences, and vice versa. As for the hyperparameter \((\alpha _w, \beta _w)\) in CrowdBT, we have \(E[\eta _w] = \frac{\alpha _w}{\alpha _w+ \beta _w}\) for worker w. That is to say, the large \(\alpha _w\) represents the high accuracy of the preferences annotated by worker w, and vice versa. For a fair comparison, we set \(\mathbf {\alpha }_w = [24,3,2,1]\) in ROPAL and \(\alpha _w=4, \beta _w=1\) \( \forall w \in \{1,2,\ldots , W\}\) in CrowdBT, namely \(E[ \eta _w^0] = 0.8\) and \(E[\eta _w] = 0.8\) for all workers in ROPAL and CrowdBT, respectively. This parameter initialization is consistent with our assumption that most students are expert workers in two peer grading datasets.

Robustness We explore the robustness of ROPAL preliminarily in real-world situations. As the PeerGrading dataset is small compared with our synthetic datasets, we first explore the effect of T Footnote 6 on the PeerGrading dataset. Specifically, we set T to 1 and 10 in Algorithm 1, respectively. Then, we repeat the experiment \(10^3\) times to collect the accuracy (mean and standard deviation) of ROPAL and other baselines in Table 3.

Table 3 To preliminarily verify the robustness of ROPAL in real-world situations, we provide the accuracy (%) on the PeerGrading dataset (PO and FR datasets)

From Table 3, we observe that: (1) ROPAL achieves higher or comparable accuracy on all experiments; (2) on PO dataset, the accuracy of PORAL and other baseline remains stable when we run the experiment multiple times (\(T=10\)). As PO dataset is a large dataset, this is consistent with our results in our synthetic experiment; and (3) since FR dataset is smaller compared to PO dataset, the accuracy of PL-based models increase obviously when we run through the FR dataset more times (\(T=10\)) on FR dataset. While those of BT-based models do not enjoy such benefit, Since BT-based models tend to split each k-ary preference into \(C_k^2\) binary preferences, BT-based models are more stable on small dataset.

4.2.2 BabyFace

Datasets We build a real-world dataset called “BabyFace”. The BabyFace dataset is based on images of child’s facial microexpressions, and includes 18 levels from happy to angry (see Fig. 6). This number of levels is attractive, since it is large enough to collect a substantial number of ordinal facial preferences, while allowing the authors to acquire a credible full preference to serve as a baseline. The BabyFace dataset consists of 816 distinct subsets, with each subset including three different microexpressions. We submit them to Amazon Mechanical Turk and require each worker to provide her preference from happy to angry. We receive preferences from 105 workers. We only consider workers who labelled at least 60 subsets, which yields a k-ary crowdsourced preferences setting containing 3074 preferences labelled by 41 workers. Furthermore, we ask 7 adults independent of our data-collecting process to provide a unanimous (full) preference of the microexpressions. Although we have no access to the ground-truth preference, the unanimous preference can serve as the ground-truth preference for three reasons: (1) the size (18) of levels is appropriate for a worker to provide a full preference; (2) the unanimous preference shared by 7 workers is consistent with our single ground-truth assumption; and (3) the 7 workers have access to all microexpressions at the same time. There is no time limitation for the annotation process, which ensures the credibility of the unanimous preference.

Fig. 6
figure 6

Images of child’s facial microexpressions with 18 levels from anger to happy. We build the “BabyFace” dataset based on these images

Parameter initialization The initialization of crowdsourced preferences in the real dataset is consistent with that in the simulated dataset. To be specific, for hyperparameter \((\mu _i,\sigma _i)\) of score \(S_i\), we assign a standard normal prior for \(S_i\), i.e. \(S_i\sim N(0,1) \forall i \in \{1,2,\ldots , 18\}\). For hyperparameter \(\mathbf {\alpha }_w\), we initialize each \(\mathbf {\alpha }_w\) by 7 gold subsets with known true preferences. The initialization method also applies to the hyperparameters of CrowdBT and other baselines.

Robustness We further explore the robustness of ROPAL on the BabyFace dataset. Following the experiment on the PeerGrading dataset, we first set T to 1 and 10 to explore the effect of T on the BabyFace dataset. Then, we repeat the experiment \(10^3\) times to collect the accuracy (mean and standard deviation) of all models in Table 4.

From the accuracy of the four models on the BabyFace dataset (the second and third lines in Table 4), we can observe that: (1) compared with \(T=1\), the accuracy of all algorithms except OnlineBT improves and the variances of all algorithms decrease for \(T=10\), which denotes that the results benefit from running the algorithms on the small dataset multiple times; (2) it is interesting to note that the accuracy of OnlineBT remains stable when we run the experiment multiple times. Since OnlineBT generates more preferences by splitting each k-ary preference into \(C_k^2\) binary preferences, OnlineBT is more stable on small datasets. However, CrowdBT does not enjoy the benefit from splitting, due to the issue of cyclical inconsistency; (3) the accuracy of ROPAL is higher than that of CrowdBT and other baselines on the BabyFace dataset with T being 1 and 10, respectively; (4) ROPAL and CrowdBT outperform OnlinePL and OnlineBT, which shows the superiority of ROPAL and CrowdBT for considering worker quality. Both OnlinePL and OnlineBT assume each worker has the expertise for all tasks. Therefore, their performance is easily affected by crowdsourced preferences; and (5) it is noted that OnlinePL has a comparable performance to CrowdBT, which is consistent with the theoretic analysis in Theorems 1 and 2, the reason being that CrowdBT splits each k-ary preference into \(C_k^2\) binary preferences before the aggregation, which leads to the issue of cyclical inconsistency.

Table 4 To further verify the robustness of ROPAL in the real world, we provide the accuracy (%) on the BabyFace dataset with T being 1 and 10, respectively

Noisy worker detection ROPAL introduces a denoising vector \(\mathbf {\eta }_w\) for each worker w. Specifically, each entry \(\eta _w^{r}\) (\(r = 0,1,\ldots ,3\)) denotes the probability that the corresponding permutation, whose Kendall-tau distance to preference \(O_{n,w}\) is r, happens to be the ground truth. Therefore, we can leverage \(E\big [\eta _w^0\big ] = \frac{\alpha _w^0}{ \sum _r{\alpha _w^r}}\) to denote the accuracy that the preferences from worker w accord with the ground truth. Likewise, CrowdBT introduces worker quality \(\eta _w\), which also denotes the annotation accuracy of worker w, where \(E[ \eta _w] = \frac{\alpha _w}{\alpha _w+ \beta _w}\). Here, we leverage these two values as the worker quality indicator to detect noisy workers.

Figure 7 shows two groups of workers. Both groups comprise the 6 lowest worker qualities detected by ROPAL and CrowdBT, respectively. We make the following observations. (1) According to the estimated \(\mathbf {\eta }_w\) of ROPAL, the first two entries of worker8 and worker17’s denoising vectors dominate the distribution. This indicates they are amateurs, who makes wrong decisions on comparable objects. Meanwhile, the third entry \(\eta _w^{2}\) of worker1, worker5 and worker12 is large, since they are most likely to be malicious workers, who annotate k-ary subsets casually or even maliciously. (2) According to the 6 lowest estimated \(\eta \) of CrowdBT, 5 of these workers overlap with workers detected by ROPAL. However, there is no doubt that these workers will be categorized as experts by CrowdBT, since their estimated \(\eta \) are very large. Therefore, ROPAL shows superiority in noisy worker detection over CrowdBT.

Fig. 7
figure 7

To verify the effectiveness of ROPAL and CrowdBT on noisy worker detection, we provide the maps of estimated \(\mathbf {\eta }\) related to noisy workers. Left Panel: The estimated \(\mathbf {\eta }\) of 6 noisy workers detected by ROPAL. Right Panel: The estimated \(\eta \) of 6 noisy workers detected by CrowdBT. Each row represents the distribution of worker quality for each worker. The column in the left panel from left to right represents the probability of the preference from correct to inverted respectively, while the column in the right panel from left to right represents the probability that the worker provides right answers and wrong answers respectively

To verify the efficacy of ROPAL and CrowdBT on noisy worker detection, we conduct the following experiments. For the BabyFace dataset, we remove the preferences annotated by the 5 noisy workers detected by ROPAL and CrowdBT simultaneously. Then, we rerun the four models on the cleaned BabyFace dataset. We repeat the experiment \(10^3\) times to measure the accuracy (as shown in mean and standard deviation) (the bottom line in Table 4). All crowdsourced preferences are consistent with the experiments on the original BabyFace dataset. Compared to the results on the BabyFace dataset in Table 4, we make the following observations according to the accuracy (%) on the cleaned BabyFace dataset in Table 4: (1) the accuracy of ROPAL stabilises around \(92\%\), which means that ROPAL is robust enough to the preferences provided by noisy workers; (2) the accuracy of OnlinePL and OnlineBT increases dramatically on the cleaned BabyFace dataset. It means that both ROPAL and CrowdBT detect the noisy (non-expert) workers in the dataset. They can be used to improve the quality of the original dataset by detecting the preferences annotated by noisy workers; (3) the accuracy of CrowdBT also improves significantly on the cleaned BabyFace dataset. Although CrowdBT can identify noisy (non-expert) workers, CrowdBT seems not be robust enough to the crowdsourced preferences from these workers. Therefore, CrowdBT is less competitive than ROPAL in terms of robustness; and (4) the standard deviation of all models reduce since the quality of the dataset improves.

5 Conclusion

In this paper, we reveal the deficiency of pairwise models for aggregating k-ary crowdsourced preferences, which motivates us to propose the RObust PlAckett–Luce (ROPAL) model. Specifically, ROPAL introduces a denoising vector to model worker quality, which corrects k-ary crowdsourced preferences with a certain probability. In addition, we design an online Bayesian inference, which makes ROPAL scalable to large-scale preferences. Comprehensive experiments on simulated and real-world datasets show that ROPAL is more robust than other well-known approaches. Moreover, ROPAL shows superiorities over CrowdBT in noisy worker detection. In future, we can extend our work in the following aspects. (1) How to choose k in practice is a meaningful direction. (2) Mixed membership models for preferences aggregation (Wang et al. 2015; Zhao et al. 2016) is a promising direction. Combining the mixed membership assumption along with the worker quality will be a new topic under the crowdsourcing setting. With these extensions, ROPAL can be applied to more complex situations.