Abstract
Multistage ranking models, including the popular Plackett–Luce distribution (PL), rely on the assumption that the ranking process is performed sequentially, by assigning the positions from the top to the bottom one (forward order). A recent contribution to the ranking literature relaxed this assumption with the addition of the discrete-valued reference order parameter, yielding the novel Extended Plackett–Luce model (EPL). Inference on the EPL and its generalization into a finite mixture framework was originally addressed from the frequentist perspective. In this work, we propose the Bayesian estimation of the EPL in order to address more directly and efficiently the inference on the additional discrete-valued parameter and the assessment of its estimation uncertainty, possibly uncovering potential idiosyncratic drivers in the formation of preferences. We overcome initial difficulties in employing a standard Gibbs sampling strategy to approximate the posterior distribution of the EPL by combining the data augmentation procedure and the conjugacy of the Gamma prior distribution with a tuned joint Metropolis–Hastings algorithm within Gibbs. The effectiveness and usefulness of the proposal is illustrated with applications to simulated and real datasets.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Ranking data are common in those experiments aimed at exploring preferences, attitudes or, more generically, choice behavior of a given population towards a set of items or alternatives (Vitelli et al. 2018; Gormley and Murphy 2006; Yu et al. 2005; Vigneau et al. 1999). A similar evidence emerges also in the sport context, yielding an ordering of the competitors, for instance players or teams, in terms of their ability or strength, see Henery (1981), Stern (1990) and Caron and Doucet (2012).
Formally, a ranking \(\pi =(\pi (1),\dots ,\pi (K))\) of K items is a sequence where the entry \(\pi (i)\) indicates the rank attributed to the \(i{\text{-}}th\) alternative. Data can be equivalently collected in the ordering format \(\pi ^{-1}=(\pi ^{-1}(1),\dots ,\pi ^{-1}(K))\), such that the generic component \(\pi ^{-1}(j)\) denotes the item ranked in the \(j{\text{-}}th\) position. Regardless of the adopted format, ranked observations are multivariate and, more precisely, correspond to permutations of the first K integers. Indeed, in line with the standards of the ranking literature [some fundamental examples can be found in Critchlow et al. (1991), Marden (1995), Alvo and Yu (2014) and Liu et al. (2019)], the vector notation \(\pi\) (or \(\pi ^{-1}\)) has a dual interpretation. In fact, borrowing from those standards, \(\pi\) is also understood as a bijective mapping \(\pi :I\rightarrow R\) and \(\pi ^{-1}:R\rightarrow I\) as its inverse, where \(I=\{1,\dots ,K\}\) and \(R=\{1,\dots ,K\}\) denote the set of items and the set of ranks both labelled as the first K integers. This explains why the inverse function notation \(\pi ^{-1}\) is used as a symbol representing a generic ordering. Hence, we warn the reader of this twofold meaning, whose correct specific interpretation should be clear from the context.
The statistical literature concerning ranked data modeling and analysis is broadly reviewed in Marden (1995) and, more recently, in Alvo and Yu (2014) and Liu et al. (2019). Several parametric distributions on the set of permutations \(\mathcal {S}_K\) have been developed and applied to real data. A popular parametric family is the Plackett–Luce model (PL), belonging to the class of the so-called stagewise ranking models. The PL was introduced by Luce (1959) and Plackett (1975) and has a long history in the ranking literature for its numerous successful applications as well as for still inspiring new research developments. The basic idea is the decomposition of the ranking process into \(K-1\) stages, concerning the attribution of each position according to the forward order, that is, the ordering of the alternatives proceeds sequentially from the most-liked to the least-liked item. The implicit forward order assumption has been released by Mollica and Tardella (2014) in the Extended Plackett–Luce model (EPL). This relaxation allows for a more flexible dependence structure, hence for a better and possibly more parsimonious fitting of the observed ranking data. The PL extension in Mollica and Tardella (2014) relies on the introduction of the reference order parameter \(\rho = (\rho (1), \ldots ,\rho (K))\), indicating the rank assignment order. Specifically, the generic entry \(\rho (t)\) indicates the rank attributed at the t-th stage of the ranking process. Thus, \(\rho\) is a discrete parameter given by a permutation of the first K integers and its estimation was originally considered from the frequentist perspective (Mollica and Tardella 2014). However, in that work there was no specific attention to the inferential ability to recover this extra parameter, as well as quantifying the estimation uncertainty. Indeed, the new EPL parameter space is of mixed-type (continuous and discrete) and, as far as the discrete component is concerned, there was little guidance on how to move efficiently towards the global optimal solution and only a multiple starting point approach was conceived. This resulted in a substantial computational burden with increasing K, which makes it less feasible a bootstrap approach to investigate parameter estimation uncertainty.
In this work, we investigate the estimation of the EPL within the Bayesian domain and detail an original MCMC method to perform approximate posterior inference. In particular, we show how the estimation of the discrete reference order parameter can be effectively addressed by combining the Gibbs sampling (GS) for some components of the parameter space with a Metropolis–Hastings (MH) step based on a joint proposal distribution.
The outline of the article is the following. After a review of the main features of the EPL and the related data augmentation approach with latent variables, the novel Bayesian EPL is introduced in Sect. 2. The detailed description of the MCMC algorithm to perform approximate posterior inference is presented in Sect. 3, whereas illustrative applications to both simulated and real ranking data follow in Sect. 4. Final remarks and hints for future research are discussed in Sect. 5.
2 The Bayesian extended Plackett–Luce model
2.1 Model specification
As described in Mollica and Tardella (2014), the probability of a generic ordering under the EPL can be written as
where the symbol \(\circ\) indicates the composition of permutations. Hereinafter, we will shortly refer to the probability distribution in (1) as \(\text {EPL}(\rho ,\underline{p})\). The positive quantities \(\underline{p}=(p_1,\dots ,p_K)\), known as support parameters, are proportional to the probabilities for each item to be selected at the first stage of the ranking process and, hence, to be ranked in the position indicated by the first entry of \(\rho\). Expression (1) reveals that, under the EPL assumption, the probability of a generic ordering is equal to the product of the conditional probability masses of the items selected at each stage, which are obtained by normalizing the corresponding support parameters with respect to the alternatives still available in the choice set. This explains the analogy of the EPL ranking construction with the sampling without replacement of the alternatives in order of preference. The standard PL is a special instance of the EPL with \(\rho =\rho _\text {F}=(1,2,\dots ,K)\), i.e., the identity permutation also named forward order; the backward PL is the special case with \(\rho =\rho _\text {B}=(K,K-1,\dots ,1)=(K+1)-\rho _\text {F}\), i.e., the backward order. A simple example of EPL distribution is described in the “Appendix”.
As in Mollica and Tardella (2017), the data augmentation with the latent quantitative variables \(\underline{y}=(y_{st})\) for \(s=1,\dots ,N\) and \(t=1,\dots ,K\) crucially contributes to make the Bayesian inference for the EPL analytically more tractable. Let \(\underline{\pi }^{-1}=\{\pi _s^{-1}\}_{s=1}^N\) be the observed sample of N orderings. The complete-data model can be specified as follows
where the auxiliary variables \(y_{st}\)’s are assumed to be conditionally independent on each other and exponentially distributed with rate parameter equal to the normalization term of the EPL. The complete-data likelihood turns out to be
where
with \(\delta _{s1i}=1\) for all \(s=1,\dots ,N\) and \(i=1,\dots ,K\).
In order to setup all the ingredients needed for a fully Bayesian analysis, we must specify a prior distribution on the unknown parameters \(\rho\) and \(\underline{p}\). In the absence of a genuinely subjective prior information, we conveniently adopt prior independence and the following marginal distributions for \(\rho\) and \(\underline{p}\)
The adoption of independent Gamma densities for the support parameters is motivated by the conjugacy with the model, as apparent by checking the form of the likelihood (2). In our analysis, we considered the hyperparameter setting \(c=d=1\).
3 Bayesian estimation of the EPL via MCMC
3.1 Attempts towards posterior sampling
In this section, we describe an original MCMC algorithm to carry out Bayesian inference for the EPL. We propose a tuned joint Metropolis-within-Gibbs sampling (TJM-within-GS) as simulation-based method to approximate the posterior distribution. Its distinguishing feature is the use of a suitably tuned MH algorithm relying on a joint proposal on the mixed-type parameter components \((\rho ,\underline{p})\), combined with two other kernels which act more specifically on the discrete component \(\rho\) and then on the continuous \((\underline{p},\underline{y})\). In fact, although partitioning the parameter vector of the augmented space by using the components of (\(\rho ,\underline{p},\underline{y}\)) makes it possible to derive standard full-conditionals for a GS, this ends up being a difficult-to-implement and unsuccessful strategy in practice. On one hand, the discrete full-conditional for \(\rho\) involves a support with a rapidly-increasing cardinality with respect to K. On the other hand, the meaning of the support parameters is strictly related to the reference order (mainly to its first component) and this means that, when one keeps \(\underline{p}\) fixed, the full-conditional of \(\rho\) is very unlikely to achieve high posterior density regions.
Indeed, if one aims at implementing an MCMC algorithm for a distribution with a mixed-type nature, one can try to split the task by separating the update of the continuous \((\underline{p},\underline{y})\), for which the full-conditionals can be easily derived, from the update of the discrete \(\rho\). For the latter, in fact, a full conditional could be derived only in examples where K is small and perhaps a Metropolis update could be used instead.
In order to setup a proposal distribution for the reference order parameter, we could rely on a sequential urn model not necessarily with the same urn composition. In that case, let us denote with \(\underline{\lambda }=(\lambda _{tj})\) the conditional multinomial probabilities governing the sequential rank attribution path such that, for each stage \(t=1,\dots ,K\), one has
where \(R_t=R{\setminus }\{\rho (1),\dots ,\rho (t-1)\}\) is the set of positions which are still available and can be assigned at stage t. Of course, at stage \(t=1\), one has \(R_1=R\); at stage \(t=K\), one has \(\lambda _{Kj}=1\) for the last available rank \(j\in R_K\). A possible proposal distribution to be employed in the MH step for sampling the reference order \(\rho\) could have the following form
where \(\underline{c}_t=(c_{tj})_{j\in R_t}\) indicates the vector with binary components defined as follows
Nevertheless, preliminary implementations of a MH step of this kind on synthetic data revealed some critical issues in achieving high posterior density regions. Some attempts are reported in Sect. 1 of the supplementary material. This suggested, for the reasons described above, that a joint proposal distribution of the reference order and the support parameters allows for an adequate mixing of the resulting Markov chain.
This argument inspired us a working MCMC algorithm which combines three different Markov kernels in the spirit of hybrid strategies described in Section 2.4 of Tierney (1994). Of course, each Markov kernel has our target posterior as invariant distribution. The most original of these kernels relies on the MH algorithm with a joint proposal strategy for proposing the update of \((\rho ,\underline{p})\). The other two kernels are nothing but simple GS kernels limited to specific blocks of the parameter components.
3.2 Tuned joint Metropolis–Hastings step
Since the marginal distribution of the observed items ranked in the first stage basically embodies a considerable portion of the information on the support parameters, we decided to decompose the joint proposal distribution starting from the first component of the reference order, immediately followed by all the support parameter components conditionally in \(\rho (1)\). In fact, in this way, the chances of proposing acceptable candidates are indeed increased. In order to sample candidate values for \(\rho\) and \(\underline{p}\) simultaneously, we devised a Metropolis kernel \(\mathcal {K}_{\text {TJM}}\) based on a joint proposal distribution \(g(\rho ,\underline{p})\) with a specific decomposition of the dependence structure, given by
The dependence structure in (3) shows that, after drawing the first component of \(\rho\), the proposal can exploit the sample evidence on the support parameters to guide the simulation of the remaining candidate entries of the reference order. In fact, it is expected that, if \(\rho (1)\) is the rank assigned at the first stage, the observed marginal frequencies of the items ranked in position \(\rho (1)\) can be regarded as an estimate of the support parameters \(\underline{p}\). In this way, the two simulated parameter vectors are linked to each other and the joint proposal should better mimic the target posterior density and, thus, get a better mixing chain. Candidate values \((\tilde{\rho },\tilde{\underline{p}})\) are jointly generated according to the following scheme:
-
1.
sample the first component of \(\tilde{\rho }\) (stage \(t=1\)):
$$\begin{aligned} \underline{\tilde{c}}_1\sim {{\,\mathrm{Multinomial}\,}}(1,\underline{\tilde{\lambda }}_{1}=(\tilde{\lambda }_{1j})_{j\in R})\quad \Rightarrow \quad \tilde{\rho }(1)=\prod _{j\in R}j^{\tilde{c}_{1j}}. \end{aligned}$$In our application, we set \(\tilde{\lambda }_{1j}={{\,\mathrm{\mathbf {P}}\,}}(\tilde{\rho }(1)=j)={{\,\mathrm{\mathbf {P}}\,}}(\tilde{c}_{1j}=1)=1/K\) for all \(j\in R\);
-
2.
sample the support parameters:
$$\begin{aligned} \tilde{\underline{p}}|\tilde{\rho }(1)\sim \text {Dirich}(\alpha _0\times \underline{r}_{\tilde{\rho }(1)}), \end{aligned}$$where Dirich denotes the Dirichlet distribution, \(\alpha _0\) is a scalar tuning parameter and \(\underline{r}_{\tilde{\rho }(1)}\) is the marginal distribution of the items with rank \(\tilde{\rho }(1)\), that is, the vector collecting the relative frequencies of the times that each item \(i=1,\dots ,K\) has been ranked in position \(\tilde{\rho }(1)\). Specifically, the \(i{\text{-}}th\) entry of \(\underline{r}_{\tilde{\rho }(1)}\) is
$$\begin{aligned} r_{\tilde{\rho }(1)i}=\frac{1}{N}\sum _{s=1}^NI_{[\pi _s^{-1}(\tilde{\rho }(1))=i]} \end{aligned}$$and, in our analysis, we set \(\alpha _0=500\);
-
3.
sample iteratively the remaining entries of \(\tilde{\rho }\) (from stage \(t=2\) to stage \(t=K-1\)): in this step, we deem it convenient to split the sampling of the remaining entries of \(\rho\) sequentially. Unfortunately, it is computationally demanding to compute the conditional distribution of the reference order component at stage t given the reference order components selected up to stage \(t-1\), the support parameters and the data. Hence, in order to surrogate it, we exploit the marginal bivariate distributions of the items in the two consecutive stages \((t-1,t)\), resulting from the observed K-variate contingency table. Indeed, once selected the reference order component at stage \(t-1\), we compute the observed contingency table \(\tilde{\tau }_{tj}\) having as first margin the item placed at the selected reference order component \(\tilde{\rho }(t-1)\) and as second margin the item placed at the reference order component \(\tilde{\rho }(t)\) which can be possibly selected at the next stage among the remaining positions in \(R_t\). The generic entry of the contingency table is
$$\begin{aligned} \tilde{\tau }_{ii'tj}=\sum _{s=1}^NI_{[\pi _s^{-1}(\tilde{\rho }(t-1))=i,\pi _s^{-1}(j)=i']}, \end{aligned}$$corresponding to the actually observed joint frequency counting how many times each item i in the previous stage is followed by any other item i’ ranked \(j{\text{-}}th\) at the next stage.
We then compare these frequencies with the corresponding expected frequencies \(\tilde{E}_{ii't}\) under the EPL by using a Monte Carlo approximation
and compute the following distances
In fact, these different distances \((\tilde{d}_{tj})_{j\in R_t}\) are intuitively ranked according to how compatible, hence likely, is the attribution of the j-th position at stage t in the light of the observed data. The above distances are then suitably scaled as \(\tilde{z}_{tj} = \frac{\tilde{d}_{tj}}{\sum _{j'\in R_t} \tilde{d}_{tj'}}\) to ensure unit sum. Note that the position corresponding to the smallest distance should be the one to be proposed with the highest probability. Moreover, we should maintain the disparities among the alternative distances so that, when the distances are very similar, the probabilities should be also very similar. To achieve these goals, we simply maintain the same values \(\tilde{z}_{tj}\) reallocating them to the positions \(j\in R_t\) according to the reverse order, so that if \(j_1\) is the position corresponding to the smallest rescaled distance \(\tilde{z}_{tj_1}\), after the rearrangement it is assigned a value \(\tilde{q}_{tj_1}=\max _{j \in R_t} \tilde{z}_{tj}\). Similarly, if \(j_2\) is the position corresponding to the second smallest rescaled distance \(\tilde{z}_{tj_2}\), it is assigned a value \(\tilde{q}_{tj_2}=\max _{j \in R_t{\setminus } j_1} \tilde{z}_{tj}\) and so on. The rearranged relative distances \(\left\{ \tilde{q}_{tj}, j \in R_t \right\}\) are used as surrogate approximation of the conditional probabilities corresponding to the target distribution. Indeed, we define the proposal probability of assigning position j at stage t as
where \(h\in (0,1/K)\) is a tuning parameter introduced to avoid a too sparse multinomial distribution. As default value we set \(h=0.1\). Finally, for \(t=2,\dots ,K-1\), we sample
The resulting joint proposal probability of the candidate values is
Hence, if we denote the observed-data likelihood with \(L(\rho ,\underline{p})\), the acceptance probability is equal to
and the MH step ends with the classical acceptance/rejection of the candidate pair
where \(u'\sim {{\,\mathrm{Unif}\,}}(0,1)\) and \((\rho ^{(l)},\underline{p}^{(l)})\) is the current pair.
In order to promote a better mixing, we combine the just illustrated Metropolis kernel \(\mathcal {K}_{\text {TJM}}\) by composing it with two additional kernels, labelled as \(\mathcal {K}_{\text {SM}}\) (Swap Move) and \(\mathcal {K}_{\text {GS}}\) (Gibbs Sampling). Hence, the MCMC simulation is based on the composition kernel \(\mathcal {K} = \mathcal {K}_{\text {TJM}} \circ \mathcal {K}_{\text {SM}} \circ \mathcal {K}_{\text {GS}}\). Indeed, the \(\mathcal {K}_{\text {SM}}\) kernel focuses on local moves of the discrete component \(\rho\), whereas \(\mathcal {K}_{\text {GS}}\) aims at improving the mixing of the continuous component. The kernel \(\mathcal {K}_{\text {SM}}\) is illustrated in the next section, while \(\mathcal {K}_{\text {GS}}\) is just a full GS cycle involving the \((\underline{p},\underline{y})\) components and is detailed in Sect. 3.4.
3.3 Swap move
We remind that the values \((\rho ',\underline{p}')\) have to be regarded as temporary parameter drawings. We decided to accelerate the exploration of the parameter space by including an intermediate kernel \(\mathcal {K}_{\text {SM}}\) of the \(\rho\) component only, that attempts a possible local move with respect to the current value. We label the possibly successful update of this kernel as Swap Move (SM). In fact, the idea relies on a random swap of two adjacent components of \(\rho '\). Specifically, one first simulate
and then define the new candidate as
Finally, by computing the acceptance probability as
the sampled value of the reference order at the \((l+1)\)th iteration turns out to be
where \(u''\sim {{\,\mathrm{Unif}\,}}(0,1)\).
3.4 Tuned Joint Metropolis-within-Gibbs-sampling
At the generic iteration \((l+1)\), the TJM-within-GS iteratively alternates the following simulation steps
In the last two lines, the full-conditional of the unobserved continuous variables y’s is derived from the construction of the complete-data model specified in Sect. 2.1, whereas the full-conditional of the support parameters is deduced by the partial conjugate structure, requiring a straightforward update of the corresponding Gamma priors.
4 Illustrative applications
4.1 Simulated data
In order to verify the efficacy of our MCMC strategy, as well as the ensuing inferential ability of the proposed Bayesian framework, we have setup the following simulation plan: we considered a grid of simulation settings combining different number of items \(K \in \{5,10,20 \}\) and sample sizes \(N \in \{50,200,1000,10{,}000\}\). For each distinct pair (K, N), we replicated 100 times the simulation of datasets from the EPL model by varying the parameter configuration: for each replication \(R=1,\dots ,100\), after fixing a true reference order \(\dot{\rho }^{(R)}\) and a true support parameter vector \(\dot{\underline{p}}^{(R)}\) by drawing \(\dot{\rho }^{(R)}\) uniformly in \({\mathcal {S}}_K\) and all the components \(\dot{p}_i^{(R)}\) i.i.d. from a uniform distribution, we simulated N i.i.d. orderings of K items from the EPL\((\dot{\rho }^{(R)},\dot{\underline{p}}^{(R)})\), forming the sample matrix \(\underline{\pi }^{-1}_{(R)}\).
For each replication, we ran the TJM-within-GS described in Sect. 3 for a total of 10,000 iterations and 2000 draws were discarded as burn-in phase. As a starting point for the MCMC simulation, we randomly selected \(\rho\) and all the components \(p_i\). The resulting 8000 simulations were used to approximate the posterior distribution. Several measures were computed for assessing the inferential performance of the proposed MCMC sampler to recover the actual parameter values \((\dot{\rho }^{(R)},\dot{\underline{p}}^{(R)})\) and the true EPL modal ranking \({\dot{\sigma }}^{(R)}\) of the K items. The ability of the approximate marginal posterior distributions to infer the known EPL paramaters and modal sequences was quantified with five criteria and by averaging the results over the 100 replicated datasets \(\underline{\pi }^{-1}_{(R)}\):
-
1.
the percentage of matches between the true reference order \(\dot{\rho }^{(R)}\) and the most frequently simulated reference order \(\hat{\rho }^{(R)}\)
$$\begin{aligned} \%\text {rec}_{\hat{\rho }}=\sum _{R=1}^{100} I_{[\dot{\rho }^{(R)}=\hat{\rho }^{(R)}]}; \end{aligned}$$ -
2.
average posterior probability on the matching estimate \(\hat{\rho }^{(R)}=\dot{\rho }^{(R)}\)
$$\begin{aligned} \bar{\mathbf {P}} ( \rho = \hat{\rho } | \underline{\pi }^{-1})=\frac{1}{\% \text {rec}_{\hat{\rho }}} \sum _{R=1}^{100} I_{[\dot{\rho }^{(R)}=\hat{\rho }^{(R)}]}\mathbf {P}(\rho =\hat{\rho }^{(R)} |\underline{\pi }^{-1}_{(R)}); \end{aligned}$$ -
3.
average relative Kendall distance between \(\dot{\rho }^{(R)}\) and \(\hat{\rho }^{(R)}\)
$$\begin{aligned} \bar{d}_{\text {K}}(\dot{\rho },\hat{\rho }) = \frac{1}{100} \sum _{R=1}^{100} d_{\text {K}}(\dot{\rho }^{(R)},\hat{\rho }^{(R)}); \end{aligned}$$ -
4.
average scaled total variation distance between \(\dot{\underline{p}}^{(R)}\) and \({\hat{\underline{p}}}^{(R)}=\text {E}[\underline{p}|\rho =\hat{\rho }^{(R)},\underline{\pi }^{-1}_{(R)}]\), given by
$$\begin{aligned} \bar{d}_{\text {TV}}(\dot{\underline{p}},\hat{\underline{p}}) = \frac{1}{\% \text {rec}_{\hat{\rho }}} \sum _{R=1}^{100} I_{[\dot{\rho }^{(R)}=\hat{\rho }^{(R)}]} d_{\text {TV}}({\dot{\underline{p}}}^{(R)},{\hat{\underline{p}}}^{(R)}) \end{aligned}$$where \(d_{\text {TV}}({\dot{\underline{p}}}^{(R)},{\hat{\underline{p}}}^{(R)})=\frac{1}{2}\sum _{i=1}^K|{\dot{p}_i}^{(R)}-{\hat{p}_i}^{(R)}|\).
-
5.
average relative Kendall distance between \(\dot{\sigma }^{(R)}\) and \(\hat{\sigma }^{(R)}\)
$$\begin{aligned} \bar{d}_{\text {K}}(\dot{\sigma },\hat{\sigma }) = \frac{1}{100} \sum _{R=1}^{100} d_{\text {K}}(\dot{\sigma }^{(R)},\hat{\sigma }^{(R)}), \end{aligned}$$where \(\dot{\sigma }^{(R)}\) is the true EPL modal ranking and \(\hat{\sigma }^{(R)}\) is the maximum a posteriori (MAP) estimate.
The first three criteria concern the estimation of \(\rho\), whereas the last two measures are used to assess the inferential performance, respectively, for \(\underline{p}\) and \(\sigma\). Note that, since the range of the Kendall distance depends on K, we rescaled it by \(K(K-1)/2\) in order to have a relative index ranging over [0, 1] regardless of K.
Additionally, we have verified whether the choice of the starting point of the MCMC simulation, especially the starting reference order, could make a difference in terms of convergence speed and its possible impact on the inferential ability to recover the true \(\dot{\rho }\). So, for all the same simulated datasets \(\underline{\pi }^{-1}_{(R)}\), we re-ran our MCMC sampling starting from the known true \(\dot{\rho }^{(R)}\) instead of a random starting reference order. Indeed, on average, we have verified no noteworthy change of inferential ability, as testified by the values of the performance criteria reported in Table 2 of the supplementary material. This can be interpreted as a sign of an MCMC algorithm with a suitable mixing behaviour not strongly dependent on the starting point.
Regarding the estimation of \(\rho\), for each replication R, we focussed on the approximate marginal posterior distribution of the reference order, denoted as \(\mathbf {P}(\rho |\underline{\pi }^{-1}_{(R)})\), and used the corresponding posterior mode \(\hat{\rho }^{(R)}\) as the point MAP estimate of \(\dot{\rho }^{(R)}\). In the first column of Table 1, we can appreciate how frequently the posterior mode \(\hat{\rho }^{(R)}\) matches the true reference order \(\dot{\rho }^{(R)}\) in terms of the percentage of true recoveries (%rec\(_{\hat{\rho }}\)). We note that this percentage consistently grows with N and, on average, a larger portion of posterior mass \(\bar{\mathbf {P}}( \rho = \hat{\rho } | \underline{\pi }^{-1})\) is assigned to the matching mode (second column of Table 1). Moreover, by considering all the replications, even those in which there is no match, the posterior mode ensures a consistently decreasing (with N) average relative Kendall distance between the true reference order and the estimated one (third column of Table 1). This means that the whole posterior distribution is consistently concentrating around \(\dot{\rho }^{(R)}\).
Also for the parameters \(\underline{p}\) and \(\sigma\), the inferential results improve with increasing N, although with a slower rate for larger values of K (forth and fifth columns of Table 1). In fact, if we look at a fixed N, we observe that the overall inferential properties worsen for increasing values of K, as expected. Indeed, for larger values of K, the cardinality of the finite space \(\mathcal {S}_K\) of the discrete parameter \(\rho\) grows factorially with K and this increases the chance that a slightly more diffuse posterior distribution might miss to select the true \(\dot{\rho }\) as the MAP estimate. Moreover, the diffuseness of the posterior is also related to the more likely closeness of some components of the true support parameters \(\dot{\underline{p}}^{(R)}\), which weakens the identifiability of the reference order. In fact, in the extreme case when all the support parameter components are the same, we know that the reference order is not identifiable.
Finally, we carried out a sensitivity analysis to investigate the possible effect of the hyperparameter and tuning value setting on the performance of our MCMC algorithm. The reader can find a detailed description of this study in the supplementary material.
4.2 Application to the sport data
We applied our Bayesian EPL to the sport dataset included in R package Rankcluster (Jacques et al. 2014), where \(N=130\) students at the University of Illinois were asked to rank \(K=7\) sports in order of preference: \(1=\) Baseball, \(2=\) Football, \(3=\) Basketball, \(4=\) Tennis, \(5=\) Cycling, \(6=\) Swimming and \(7=\) Jogging. Prior to the Bayesian EPL analysis, sample heterogeneity has been preliminary investigated by using the EPL mixture methodology presented by Mollica and Tardella (2014), that suggested the presence of two preference groups labelled as \(G_1\) and \(G_2\) with sizes \(N_1=62\) and \(N_2=68\) (best fitting 2-component EPL mixture with BIC=2131.25). We then decided to estimate the Bayesian EPL separately on the two clusters. Furthermore, our modeling proposal has been compared with two special cases thereof: the forward (standard) PL and the backward PL, respectively labelled as EPL-\(\rho _{\text {F}}\) and EPL-\(\rho _{\text {B}}\). To inspect the fitting gain obtained with the EPL parametric class, the popular Deviance Information Criterion (DIC) developed by Spiegelhalter et al. (2002) was adopted.
To fit the Bayesian EPL, we ran the TJM-within-GS for a total of 20,000 iterations and 10,000 samples were discarded as burn-in phase. The MCMC algorithm was launched with four randomly dispersed starting points to explore the mixing performance of the sampler over the parameter space. Indeed, the four chains were found to be consistent with respect to the initial values. In this regard, Fig. 1 shows the traceplots of the log-posterior values for the four MCMC chains launched with alternative starting values for the two subsamples, where it is apparent that the MCMC algorithm exhibits a good mixing behavior and a satisfactory convergence regardless of the initialization. The graphical inspection was confirmed by Geweke’s convergence diagnostics (equal to 1.29, 1.29, 0.33 and 1.68 for \(G_1\) and to 0.71, 1.54, 0.16 and 1.56 for \(G_2\)) and Gelman and Rubin’s convergence diagnostics (equal to 1.002 for \(G_1\) and to 1.004 for \(G_2\)). Convergence diagnostics for the traces of the support parameters are shown in the supplementary material.
The DIC for the inference on the first subsample \(G_1\) highlights evidence in favor of the conventional PL, corresponding to the minimum value highlighted in bold in the first column of Table 2. Indeed, this result is recovered through the estimation of the more general EPL, whose DIC is very close to that of the PL. For the second subsample \(G_2\), instead, the DIC indicates a significantly improvement of the fit associated to the EPL family with respect to the two considered restrictions thereof (second column of Table 2).
Figure 2 shows the approximations of the posterior distribution on the reference order for the two subsamples of the sport dataset. The corresponding posterior modal reference orders turned out to be \(\hat{\rho }_1=(2,1,3,5,4,6,7)\) and \(\hat{\rho }_2=(5,6,4,7,3,1,2)\), i.e., a reference order very similar to the forward one and an alternative reference sequence suggesting a preference elicitation starting from the middle-bottom positions. The posterior means and standard deviations of the group-specific support parameters conditionally on the MAP estimates \(\hat{\rho }_1\) and \(\hat{\rho }_2\) are detailed in Table 3. Those obtained from the EPL-\(\rho _{\text {F}}\) and the EPL-\(\rho _{\text {B}}\) are reported in the supplementary material. The MAP estimates of the group-specific modal orderings under the EPL scenario turn out to be (\(2=\) Football, \(1=\) Baseball, \(3=\) Basketball, \(5=\) Cycling, \(4=\) Tennis, \(6=\) Swimming, \(7=\) Jogging) and (\(6=\) Swimming, \(7=\) Jogging, \(4=\) Tennis, \(3=\) Basketball, \(2=\) Football, \(1=\) Baseball, \(5=\) Cycling) indicating opposite preferences in the two subsamples towards team and individual sports. Actually, we verified that these modal sequences are very similar to those found with the clustering approach described by Marden (1995).
4.3 Application to the sushi data
For the second real data application, we considered a subsample of the popular sushi data concerning preferences on \(K=10\) types of sushi, namely \(1=\) Shrimp, 2=Sea eel, 3=Tuna, \(4=\) Squid, \(5=\) Sea urchin, \(6=\) Salmon roe, \(7=\) Egg, \(8=\) Fatty tuna, \(9=\) Tuna roll, \(10=\) Cucumber roll. See Kamishima (2003) for further details on the sushi dataset. Specifically, we focused on the \(N=\) 134 sequences regarding rankers who declared to have most longly lived in the 39th prefecture until 15 years old.
Besides the forward and backward PL, also the Bayesian distance-based model with the Kendall distance (DB-Kend) was considered as a possible competitor of the EPL to learn preferences on the sushi types. The latter was fitted to the data by employing the functions of the R package BayesMallows, recently released on CRAN (Vitelli et al. 2018) and including the entire sushi dataset of 5000 complete rankings. As in the previous application, four chains of the TJM-within-GS procedure were launched with randomly dispersed starting points and the DIC was used to select the optimal model.
We checked the effective achievement of convergence of the MCMC output with the analysis of the traceplots and the computation of relevant diagnostics. For the log-posterior values, we obtained Geweke’s z-scores equal to 1.73, 1.61, 0.87 and 0.64 and Gelman and Rubin’s potential scale reduction factor equal to 1.001. Convergence diagnostics for the traces of the support parameters are shown in the supplementary material. We then focussed on model comparison and parameter interpretation. For the sushi data analysis, we noticed a greater variability associated to the posterior distribution of the reference order, where probability masses turn out to be diffuse on a neighbourhood close to the backward path (Fig. 3). This variability represents the uncertainty on the reference order estimation which is overlooked by the restricted EPL-\(\rho _{\text {B}}\). The DIC indicates a remarkable superiority of EPL-\(\rho _{\text {B}}\) over EPL-\(\rho _{\text {F}}\) (DIC=3819.65 vs DIC=3885.94), but the more flexible EPL class, which properly accounts for the uncertainty on \(\rho\), produces a further improvement of the final fit (DIC \(=3815.07\)). The new Bayesian EPL outperforms also the DB-Kend (DIC \(=3838.95\)). The posterior means and standard deviations of the EPL support parameters are reported in Table 4. Those obtained from the EPL-\(\rho _{\text {F}}\) and the EPL-\(\rho _{\text {B}}\) are reported in the supplementary material. The MAP estimate of the modal ordering under the EPL scenario is equal to (\(5=\) Sea urchin, \(8=\) Fatty tuna, \(1=\) Shrimp, \(6=\) Salmon roe, \(3=\) Tuna, \(4=\) Squid, \(2=\) Sea eel, \(10=\) Cucumber roll, \(9=\) Tuna roll, 7=Egg).
5 Conclusions
We have addressed some relevant issues in modelling and inferring choice behavior and preferences. The standard PL for complete rankings relies on the hypothesis that the probability of a particular ordering does not depend on the subset of items from which one can choose (Luce’s Axiom). Although this can be considered a very strong assumption, the widespread use of the PL suggested to exploit it as a building block to gain more flexibility. In particular, Mollica and Tardella (2014) explored the possibility of adding flexibility with the help of two main ideas: (1) the use of an additional discrete parameter, the reference order, specifying the order of the ranks sequentially assigned by the individual and which should be inferred from the data; (2) the finite mixture of PL distributions enriched with the reference order parameter (EPL mixture).
In this paper, we have focussed on (1) and developed a methodology for a fully Bayesian estimation of the EPL distribution. We have devised a hybrid strategy by combining appropriately tuned MH kernels and the GS. This allows for a successful exploration of the whole mixed-type parameter space. The resulting Bayesian inference turned out to be effective in inferring the true reference order and provided a suitable MCMC simulation which can be used to quantify the uncertainty on the EPL parameters. We stress that the previous attempts to infer on the underlying reference order were limited to the maximum likelihood point estimation via EM algorithm, which required a brute-force multiple starting point strategy in order to attempt to reach the global maximum. We note that, in order to assess the uncertainty on the reference order parameter, one should employ a bootstrap approach which, of course, implies a greater computational effort. Moreover, as K increases, a fixed number of multiple starting points becomes a factorially decreasing (hence negligible) fraction of all the possible initializations and, obviously, this hampers the possibility of achieving eventually the global optimum in applications with larger K. On the other hand, the hybrid MCMC strategy provided valid results which are less sensitive to the starting reference order. Hence, our Bayesian methodology provides both an appropriate way to assess parameter uncertainty and a computational improvement for inferring the EPL.
Simulation studies explored the sensitivity of TJM-within-GS to the prior hyperparameters as well as tuning constants. They also evaluated its efficacy in recovering the true EPL parameters generating the data. Finally, the novel parametric approach was successfully applied to two real datasets concerning the analysis of preference patterns.
Moreover, our model setup gains more insights on the sequential mechanism of formation of preferences and whether it privileges a more or less naturally ordered assignment of the most extreme ranks. In other words, we show how it is possible to assess with a suitable statistical approach the formation of ranking preferences and answer through a statistical model the following questions: “What do I start ranking first? And what do I do then?”.
For possible future developments, several directions can be contemplated to further extend the Bayesian EPL. First, the methodology can be generalized for the accommodation of partial orderings and the introduction of item-specific and individual covariates that can improve the characterization of the preference behavior. Moreover, a Bayesian EPL mixture could fruitfully support more efficiently the identification of a parsimonious cluster structure in the sample.
References
Alvo M, Yu PL (2014) Statistical methods for ranking data. Springer, New York
Caron F, Doucet A (2012) Efficient Bayesian inference for generalized Bradley–Terry models. J Comput Graph Stat 21(1):174–196
Critchlow DE, Fligner MA, Verducci JS (1991) Probability models on rankings. J Math Psychol 35(3):294–318
Gormley IC, Murphy TB (2006) Analysis of Irish third-level college applications data. J R Stat Soc A 169(2):361–379
Henery RJ (1981) Permutation probabilities as models for horse races. J R Stat Soc B 43(1):86–91
Jacques J, Grimonprez Q, Biernacki C (2014) Rankcluster: an R package for clustering multivariate partial rankings. R J 6(1):101–110
Kamishima T (2003) Nantonac collaborative filtering: recommendation based on order responses. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 583–588
Liu Q, Crispino M, Scheel I, Vitelli V, Frigessi A (2019) Model-based learning from preference data. Annu Rev Stat Appl 6:329–354
Luce RD (1959) Individual choice behavior: a theoretical analysis. Wiley, New York
Marden JI (1995) Analyzing and modeling rank data. Monographs on statistics and applied probability, vol 64. Chapman & Hall, New York
Mollica C, Tardella L (2014) Epitope profiling via mixture modeling of ranked data. Stat Med 33(21):3738–3758
Mollica C, Tardella L (2017) Bayesian mixture of Plackett–Luce models for partially ranked data. Psychometrika 82(2):442–458
Plackett RL (1975) The analysis of permutations. J R Stat Soc C 24(2):193–202
Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A (2002) Bayesian measures of model complexity and fit. J R Stat Soc B 64(4):583–639
Stern H (1990) Models for distributions on permutations. J Am Stat Assoc 85(410):558–564
Tierney L (1994) Markov chains for exploring posterior distributions. Ann Stat 22(4):1701–1762
Vigneau E, Courcoux P, Semenou M (1999) Analysis of ranked preference data using latent class models. Food Qual Prefer 10(3):201–207
Vitelli V, Sørensen Ø, Crispino M, Frigessi A, Arjas E (2018) Probabilistic preference learning with the Mallows rank model. J Mach Learn Res 18(158):1–49
Yu PLH, Lam KF, Lo SM (2005) Factor analysis for ranked data with application to a job selection attitude survey. J R Stat Soc A 168(3):583–597
Acknowledgements
We are deeply grateful to both anonymous referees, whose comments and suggestions allowed us to improve the article. This work has been supported by Sapienza Università di Roma, Grant RP11816436B15B6B.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
1.1 An example of EPL model
Here is a simple example to clarify the EPL formulation introduced in (1). Without loss of generality, let us suppose that an \(\text {EPL}(\dot{\rho },\underline{\dot{p}})\) with parameter values \(\dot{\rho }=(4,1,3,2)\) and \(\underline{\dot{p}}=(0.4,0.3,0.2,0.1)\) represents the true data generating mechanism. Under this EPL scenario, the positions are assigned according to an alternating preference scheme: the item selected at the first stage corresponds to the least-liked alternative (\(\dot{\rho }(1)=4\)); at the second stage, the most-liked item is specified (\(\dot{\rho }(2)=1\)); the item ranked at the third stage is the one receiving the third position (\(\dot{\rho }(3)=3\)) and, finally, the remaining alternative of the forth stage is placed second in order of preference (\(\dot{\rho }(4)=2\)). Regarding the support parameters, they reflect a decreasing first-stage choice probability such that \({\dot{p}}_i\propto (K+1)-i\). Hence, the chance of being ranked last reduces when proceeding from alternative 1 up to alternative 4: item 1 is more likely to be chosen at the first step and, thus, to be ranked last, followed in the order by item 2, 3 and 4.
Since the rank assignment order \(\rho\) is not restricted to the identity permutation \(\rho _{\text {F}}\) as in the PL, a generic ordering \(\pi ^{-1}\) does not coincide with the sequence \(\eta ^{-1}=\pi ^{-1}\circ \rho\) listing the items selected at each stage of the ranking process. For example, the considered \(\text {EPL}(\dot{\rho },\underline{\dot{p}})\) implies that observing the ordering \(\pi ^{-1}=(3,1,4,2)\) corresponds to the sequential item selections indicated by the composition below
that is, one chooses item 2 at the first stage, item 3 at the second stage, item 4 at the third stage and item 1 at the last stage.
Equation (1) states that the probability mass associated to \(\pi ^{-1}=(3,1,4,2)\) under the specified \(\text {EPL}(\dot{\rho },\underline{\dot{p}})\) can be computed from the PL distribution after rearranging the components of \(\pi ^{-1}\) according the reference order \(\dot{\rho }\):
Rights and permissions
About this article
Cite this article
Mollica, C., Tardella, L. Bayesian analysis of ranking data with the Extended Plackett–Luce model. Stat Methods Appl 30, 175–194 (2021). https://doi.org/10.1007/s10260-020-00519-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-020-00519-5