Keywords

1 Introduction

Top-k lists are a special form of item orderings (i.e., rankings) wherein out of n total items only a small number of them, k, are explicitly ordered. Top-k lists have many advantages that can overcome some of the practical drawbacks of the traditional full-list approach: a collection of items may be too large to rank or even present, processing the full list could present a massive computational/cognitive load, and it may be impossible or meaningless to compare and rank items beyond a certain point [7]. Examples of top-k lists are the top-250 movies on IMDB or the top-10 played songs on Spotify [22].

Due to the increased use of such lists, the top-k list aggregation problem (TOP-k-AGG) has attracted considerable attention. TOP-k-AGG seeks to find a top-k list or full list that best represents the input lists. This problem has been utilized in many different applications, including recommender systems [20], metasearch engines [12], and bioinformatics [17]. TOP-k-AGG is interrelated with many other problems such as top-k recommendation and top-k query.

TOP-k-AGG falls under the umbrella of the more general rank aggregation problem whose objective is to combine individual rankings over a set of items into one representative collective ranking [5]. Variants of this problem have been studied probabilistically [6, 8] and deterministically [10, 12]. In the probabilistic approach, it is assumed that the observed rankings are realizations of a probabilistic model on ranking data, such as Mallows model [16], and the goal is to recover the ground-truth ranking.

Deterministic approaches can be further categorized into score-based and distance-based methods. Approaches in the first category apply relatively simple and efficient functions to calculate the score of each item, and the aggregate ranking is obtained by sorting items based on their total scores. Score-based methods are relatively susceptible to errors and manipulation, and they may violate certain fundamental social choice properties [5]. Conversely, distance-based methods provide more robust aggregation mechanisms. The aim of these approaches is to find a consensus list that has the least cumulative disagreement with the input lists. They are typically founded on axiomatic frameworks, from which the aggregate solution is formally guaranteed to satisfy certain desirable properties [9]. However, their aggregation problems tend to be more computationally demanding and are often NP-hard [5].

Distance-based TOP-k-AGG techniques can be divided based on whether the output ranking is considered a full list or another top-k list. Dwork et al. [10], Ailon [1], and Nápoles et al. [19] fall into the first category; Fagin et al. [12] falls into the second category. The works referenced under the first category define TOP-k-AGG as finding a full list with the least cumulative distance to the input lists using the induced Kendall tau, Kendall tau, and Hausdorff distances, respectively. Fagin et al. [12]’s method provides higher flexibility, and it induces a far smaller solution space. Letting n denote the total number of items, there are \(\left( {\begin{array}{c}n\\ k\end{array}}\right) k!\) possible top-k lists using the latter approach, which is \((n - k)!\) times smaller than n! (the number of possible full strict lists over n).

There are various distance measures for comparing top-k lists including generalized Kendall tau, generalized Spearman’s footrule, Hausdorff [12], and Goodman and Kruskal’s gamma [14]. This paper focuses on the distance-based variant of TOP-k-AGG induced by the generalized Kendall tau distance [12]. This focus is motivated by its widespread use for comparing top-k lists, and more importantly, its flexibility at handling partial information from these lists. This distance measure has been used in this capacity for similarity search [21], search engines [18], and influence maximization [4]. Additionally, variants of this distance have been used for comparing and aggregating bucket orders [2, 11] and top-k XML lists [23]. However, to the best of our knowledge, this distance measure has not been utilized for the purpose of aggregating top-k lists since its introduction in Fagin et al. [12], possibly due to a lack of existing exact methods. To facilitate this essential use of the distance measure, this paper studies various exact mathematical formulations.

Contributions. Section 3 introduces a binary nonlinear programming formulation and four mixed integer linear programming (MIP) formulations of TOP-k-AGG under the generalized Kendall tau distance. Two of these formulations result from the introduction of preference cycle-prevention constraints specific to TOP-k-AGG. Section 4 compares the strengths of the MIP formulations using techniques from polyhedral theory. The mathematical formulations and polyhedral analyses presented herein can be extended to TOP-k-AGG using any other distance measure between top-k lists by modifying the objective functions accordingly.

2 Preliminaries

The rank aggregation problem was originally defined over strict rankings. Formally, a strict ranking \(\boldsymbol{\pi } \) is a bijection of \([n] = \{1, 2, \dots , n\}\) onto itself, which represents a strict order of the n items. The Kendall tau distance [15] is one of the most prominent measures of dissimilarity between rankings, which counts the number of distinct item-pairs whose relative order is different in two rankings. The Kendall tau distance between strict rankings \(\boldsymbol{\pi } ^{1}, \boldsymbol{\pi } ^{2}\) is given by \(K(\boldsymbol{\pi }^1, \boldsymbol{\pi }^2) = \sum \limits _{i \in [n]}\sum \limits _{j \in [n]} K_{i, j}(\boldsymbol{\pi }^1, \boldsymbol{\pi }^2)\), where \(K_{i, j}(\boldsymbol{\pi }^1, \boldsymbol{\pi }^2)\) is set to 1 if the relative orderings of i and j are different in \(\boldsymbol{\pi }^1\) and \(\boldsymbol{\pi }^2\), and 0 otherwise. The rank aggregation problem under Kendall tau distance is known alternatively as Kemeny Aggregation (KEMENY-AGG).

A top-k list \(\boldsymbol{\tau } \) is a bijection from a domain \(\boldsymbol{\mathcal {I}} _{\boldsymbol{\tau } }\) (the members of \(\boldsymbol{\tau } \)) to \([k] = \{1, \dots , k\}\), where \(k < n\). All items in \(\boldsymbol{\tau } \) are presumed to be ranked ahead of items not in \(\boldsymbol{\tau } \); however, the exact ordering of items not in the list is unknown. Let \(i \in \boldsymbol{\tau } \) indicate that item i appears in the top-k list, and let \(\boldsymbol{\tau } (i)\) denote the rank or position of i therein. Additionally, let \(i \succ _{\boldsymbol{\tau } } j\) denote that item i is rank ahead of item j in \(\boldsymbol{\tau } \), that is, if \((i \in \boldsymbol{\tau } \wedge j \notin \boldsymbol{\tau } )\) OR \((i, j \in \boldsymbol{\tau } \wedge (\boldsymbol{\tau } (i) < \boldsymbol{\tau } (j)))\). Given top-k lists \(\boldsymbol{\tau } ^{1}\) and \(\boldsymbol{\tau } ^{2}\), let \(\boldsymbol{\varLambda }(\boldsymbol{\tau } ^{1}, \boldsymbol{\tau } ^{2})\) be the set of all unordered pairs of distinct items in \(\boldsymbol{\mathcal {I}} _{\boldsymbol{\tau } ^{1}} \bigcup \boldsymbol{\mathcal {I}} _{\boldsymbol{\tau } ^{2}}\).

Definition 1

(TOP-k-AGG). Let \(\boldsymbol{\mathcal {L}} = \{1, 2, \dots , m \}\) be the set of indices of the input top-k lists, \(\boldsymbol{\tau } ^l\) be the input top-k list \(l \in \boldsymbol{\mathcal {L}}\), \(\boldsymbol{\mathcal {I}} = \bigcup \limits _{l \in \boldsymbol{\mathcal {L}}}{\boldsymbol{\mathcal {I}} _{\boldsymbol{\tau } ^{l}}}\) be the universe of items, \(n := |\boldsymbol{\mathcal {I}} |\) be the number of items in the universe \(\boldsymbol{\mathcal {I}} \), \(\boldsymbol{\mathcal {T}}\) be the set of all possible top-k lists over \(\boldsymbol{\mathcal {I}} \), and d(., .) be a distance measure between top-k lists. TOP-k-AGG seeks to find a top-k list \(\boldsymbol{\tau } ^* \in \boldsymbol{\mathcal {T}}\) with the lowest cumulative distance to the input lists; it can be written succinctly as

$$\begin{aligned} \boldsymbol{\tau } ^* = \mathop {\textrm{argmin}}\limits \limits _{\boldsymbol{\tau } \in \boldsymbol{\mathcal {T}}}\ \sum \limits _{l \in \boldsymbol{\mathcal {L}}}d(\boldsymbol{\tau } , \boldsymbol{\tau } ^{l}). \end{aligned}$$
(1)

The rest of this paper focuses on the generalized Kendall tau distance [12]. Accordingly, the distance is restated in the following. Let p be a fixed parameter, with \(0 \le p \le 1\), and let \(K_{i, j}^{(p)}(\boldsymbol{\tau } ^{1}, \boldsymbol{\tau } ^{2})\) be the contribution to the distance function, for each item-pair \((i, j) \in \boldsymbol{\varLambda }(\boldsymbol{\tau } ^{1}, \boldsymbol{\tau } ^{2})\). The generalized Kendall tau distance with penalty parameter p, denoted by \(K^{(p)}\), is defined as

$$\begin{aligned} K^{(p)}(\boldsymbol{\tau } ^{1}, \boldsymbol{\tau } ^{2}) = \sum \limits _{(i, j) \in \boldsymbol{\varLambda }(\boldsymbol{\tau } ^{1}, \boldsymbol{\tau } ^{2})} K_{i, j}^{(p)}(\boldsymbol{\tau } ^{1}, \boldsymbol{\tau } ^{2}), \end{aligned}$$
(2)

where

$$\begin{aligned} K_{i, j}^{(p)}(\boldsymbol{\tau } ^{1}, \boldsymbol{\tau } ^{2}) = {\left\{ \begin{array}{ll} 1 &{} (i \succ _{\boldsymbol{\tau } ^{1}} j \wedge j \succ _{\boldsymbol{\tau } ^{2}} i) \vee (j \succ _{\boldsymbol{\tau } ^{2}} i \wedge i \succ _{\boldsymbol{\tau } ^{1}} j)\\ p &{} (i, j \in \boldsymbol{\tau } ^1\wedge i, j \notin \boldsymbol{\tau } ^2) \vee (i, j \notin \boldsymbol{\tau } ^1\wedge i, j \in \boldsymbol{\tau } ^2) \\ 0 &{} \text {otherwise}.\\ \end{array}\right. } \end{aligned}$$

\(K^{(p)}\) is a near metric since it satisfies a relaxed version of the triangle inequality [12]. TOP-k-AGG under \(K^{(p)}\) is a combinatorial NP-hard problem [12], which includes KEMENY-AGG as a special case (when \(k = n\)).

3 Integer Programming Formulations

To the best of our knowledge, no efforts have been made to derive an explicit mathematical model of TOP-k-AGG. This section presents various formulations.

First, we define required parameters for defining the objective functions of the presented formulations. Let \(\mu _{il}\) be an indicator parameter that is equal to 1 if \(i \in \boldsymbol{\tau } ^l\), where \(l \in \boldsymbol{\mathcal {L}}\). Additionally, let \(s_{ij}\) denote the number of input lists where item i is ranked ahead of item j, which can be expressed as

$$\begin{aligned} \begin{aligned} s_{ij}&= \sum \limits _{l \in \boldsymbol{\mathcal {L}}}\mathbbm {1}_{(i, \, j \in \boldsymbol{\tau } ^{l} \,\, \wedge \,\, (\boldsymbol{\tau } ^{l}(i) \,\,< \,\, \boldsymbol{\tau } ^{l}(j)) \, \vee \, (i \in \boldsymbol{\tau } ^{l} \,\,\wedge \,\, j \notin \boldsymbol{\tau } ^{l})} \\ {}&= \sum \limits _{l \in \boldsymbol{\mathcal {L}}}\left[ \mu _{il} \, \mu _{jl} \, \mathbbm {1}_{\boldsymbol{\tau } ^{l}(i) < \boldsymbol{\tau } ^{l}(j)} + \mu _{il}(1 - \mu _{jl})\right] . \end{aligned} \end{aligned}$$
(3)

In words, \(s_{ij}\) tallies the number of input lists in which i is ranked ahead of j, that is, the number of input lists in which both items are present and i is ranked ahead of j, plus the number of inputs lists in which i is present but j is not.

Using these parameters, the cumulative \(K^{(p)}\) distance between a given top-k list \(\boldsymbol{\tau } \in \boldsymbol{\mathcal {T}}\) and all of the input top-k lists, i.e., \( \sum \limits _{\boldsymbol{\tau } ^l \in \boldsymbol{\mathcal {L}}}\sum \limits _{(i, j) \in \boldsymbol{\varLambda }(\boldsymbol{\tau } , \boldsymbol{\tau } ^{l})} K_{ij}^{(p)}(\boldsymbol{\tau } , \boldsymbol{\tau } ^{l})\), can be expressed as \(\sum \limits _{(i, j) \in \boldsymbol{\varLambda }} K_{ij}^{(p)}(\boldsymbol{\tau } )\) where \(\boldsymbol{\varLambda }\) is set of all unordered pairs of distinct items in \(\boldsymbol{\mathcal {I}} \), and

(4)

Equation (4) states that, whenever item i and j are both present in \(\boldsymbol{\tau } \) (the solution top-k list) and i is ranked ahead of item j, the imposed \(K^{(p)}\) distance between \(\boldsymbol{\tau } \) and all of the input lists for this pair of items equals the number of input lists where j is ranked ahead of i, plus p-times the number of input lists neither i nor j is present in the same list. Whenever i but not j is present in \(\boldsymbol{\tau } \), the imposed \(K^{(p)}\) distance equals the number of input lists where j is ranked ahead of i. Finally, whenever neither i nor j is present in \(\boldsymbol{\tau } \), the imposed \(K^{(p)}\) distance equals p times the number of input lists where i and j are simultaneously present.

The first formulation is an MIP possessing an assignment problem-like structure, with which exactly k items are assigned to the k available positions of the solution top-k list. Its decisions variables are as follows:

$$\begin{aligned}&u_{it} = {\left\{ \begin{array}{ll} 1 &{} \text {if}\, i\, \text {is assigned to position}\, t \in [k] \\ 0 &{} \text {otherwise}; \end{array}\right. }\\&w_{ij} = {\left\{ \begin{array}{ll} 1 &{} \text {if}\, i\, \text {and}\, j \,\text {are in the top-}k\, \text {list, and}\, i\,\text {is ranked ahead of}\, j\\ 0 &{} \text {otherwise}; \end{array}\right. }\\&w'_{ij} = {\left\{ \begin{array}{ll} 1 &{} \text {if}\, i \,\text {is in the top-}k\,\text {list, but not}\, j\\ 0 &{} \text {otherwise}; \end{array}\right. }\\&w''_{ij} = {\left\{ \begin{array}{ll} 1 &{} \text {if neither}\,i\,\text {nor}\, j\,\text { is present in the top-}k\, \text {list, where}\, j > i\\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

From the definitions, item i is present in the top-k list if \(\sum _{t = 1}^{k}u_{it} = 1\), and it is absent if \(\sum _{t = 1}^{k}u_{it} = 0\). The variables \(\boldsymbol{w, w'}\), and \(\boldsymbol{w''}\) determine the relative ordering of the items; these are dependent variables, as their exact values are determined by the values of the \(\boldsymbol{u}\)-variables. The first formulation (MIP#1) is as follows.

$$\begin{aligned} {\begin{matrix} \min _{u, w, w', w''} \quad &{} \sum \limits _{i \in \boldsymbol{\mathcal {I}} }\sum \limits _{j \in \boldsymbol{\mathcal {I}} } \bigg [(s_{ji} + p\sum \limits _{l \in \boldsymbol{\mathcal {L}}}(1 - \mu _{il})(1 - \mu _{jl}))w_{ij} + s_{ji}w'_{ij}\bigg ] + \\ {} &{}p \sum _{i, j \in \boldsymbol{\mathcal {I}} , j > i}\sum \limits _{l \in \boldsymbol{\mathcal {L}}} \mu _{il}\mu _{jl}w''_{ij} \end{matrix}}\end{aligned}$$
(5a)
$$\begin{aligned} \text {s.t.} \quad&\sum _{i \in \boldsymbol{\mathcal {I}} }u_{it} = 1 \qquad \qquad \forall t \in [k] \end{aligned}$$
(5b)
$$\begin{aligned}&\sum _{t \in [k]}u_{it} \le 1 \qquad \qquad \forall i \in \boldsymbol{\mathcal {I}} \end{aligned}$$
(5c)
$$\begin{aligned}&w_{ij} \ge \sum _{t' = 1}^{t}u_{it'} + \sum _{t'' = t + 1}^{k}u_{jt''} - 1 \quad \forall i, j \in \boldsymbol{\mathcal {I}} , i \ne j; \forall t \in [k-1] \end{aligned}$$
(5d)
$$\begin{aligned}&\sum _{i, j \in \boldsymbol{\mathcal {I}} }w_{ij} \le \frac{k(k - 1)}{2}\end{aligned}$$
(5e)
$$\begin{aligned}&w'_{ij} \ge \sum _{t \in [k]}u_{it} - \sum _{t \in [k]}u_{jt} \qquad \forall i, j \in \boldsymbol{\mathcal {I}} , i \ne j \end{aligned}$$
(5f)
$$\begin{aligned}&\sum _{i, j \in \boldsymbol{\mathcal {I}} }w'_{ij} = k(n - k) \end{aligned}$$
(5g)
$$\begin{aligned}&w''_{ij} \ge 1 - \sum _{t \in [k]}u_{it} - \sum _{t \in [k]}u_{jt} \qquad \forall i, j \in \boldsymbol{\mathcal {I}} , i \ne j \end{aligned}$$
(5h)
$$\begin{aligned}&\sum _{i, j \in \boldsymbol{\mathcal {I}} , j > i}w''_{ij} = \frac{(n - k)(n - k - 1)}{2} \end{aligned}$$
(5i)
$$\begin{aligned}&u_{it} \in \{0, 1\} \qquad \qquad \quad \forall i \in \boldsymbol{\mathcal {I}} ; \forall t \in [k]\end{aligned}$$
(5j)
$$\begin{aligned}&w_{ij}, w'_{ij} \ge 0 \qquad \qquad \quad \forall i, j \in \boldsymbol{\mathcal {I}} , i \ne j\end{aligned}$$
(5k)
$$\begin{aligned}&w''_{ij} \ge 0 \qquad \qquad \quad \forall i, j \in \boldsymbol{\mathcal {I}} , j > i . \end{aligned}$$
(5l)

Objective function (5a) minimizes the cumulative \(K^{(p)}\) distance to the input lists according to Eq. (4). Constraint (5b) enforces that exactly one item must be assigned to each position of the top-k list. Constraint (5c) enforces that every item must be assigned to at most one position of the list. Constraint (5d) determines the respective values of the \(\boldsymbol{w}\)-variables. More specifically, \(w_{ij} = 1\) if i occupies one of the first t positions (\(\sum _{t' = t + 1}^{t}u_{it'} = 1\)) and j occupies position \(t''\), where \(t + 1 \le t'' \le k\) (\(\sum _{t'' = t + 1}^{k}u_{jt''} = 1\)); otherwise, this constraint becomes redundant. Constraint (5d) and (5e) together impose preference transitivity (i.e., prevent preference cycles); this means that if h is ranked ahead of i, and i is ranked of j, then h must be ranked ahead of j as well (see Theorem 1). Constraint (5f) determines the respective values of \(\boldsymbol{w'}\)-variables; it enforces that \(w'_{ij} = 1\) if i is present in the top-k list but not j; otherwise, this constraint becomes redundant. Constraint (5g) enforces that at most \(k(n - k)\) of the \(\boldsymbol{w'}\)-variables can take a value of 1 as there are \(k(n - k)\) distinct item-pairs where exactly one of the items appears in the list. Constraint (5h) enforces that \(w''_{ij} = 1\) if neither i nor j is present in the top-k list; otherwise, this constraint becomes redundant. Constraint (5i) enforces that at most \((n - k)(n - k - 1)/2\) of the \(\boldsymbol{w''}\)-variables can take a value of 1 as this is the number of distinct item-pairs where both items are absent from the list. Constraints (5j)–(5l) specify the domain of the variables.

Taking a closer look at the structure of the constraints, we can observe that even though variables \(\boldsymbol{w, w'}\) and \(\boldsymbol{w''}\) are specified as binary, they can be treated as non-negative continuous variables since the constraints of the model alone enforce them to only take a value of 0 or 1. It is important also to remark that the reason for including constraints (5f) and (5g) is that the objective function coefficients are not necessarily positive. More specifically, if both i and j are present in the solution top-k list, constraint (5f) implies that \(w'_{ij} \ge 0\); however, if the objective function coefficient \(s_{ij}\) is 0, then any value of \(w'_{ij}\) results in the same objective function value, which is not desirable.

Theorem 1

Constraints (5d)–(5e) impose preference transitivity.

Proof

Assume that items hij are present in the solution top-k list with h placed in position \(t \ge 1\), i in position \(t' > t\), and j in position \(t''\), where \(k \ge t'' > t'\). Constraint (5d) enforces that \(w_{hi} = w_{hj} = w_{ij} = 1\). However, this constraint only implies that \(w_{jh} \ge -1\). In other words, the optimization model may have incentive to assign \(w_{jh} = 1\), creating a preference cycle, in order to decrease the objective function value. Hence, Constraint (5d) on its own does not prevent preference cycles.

However, the total number of \(\boldsymbol{w}\)-variables that must take a value of 1 is given by \((k - 1) + (k - 2) + \dots + 1 + 0 = k(k - 1)/2\)—the first-ranked item is ahead of \(k - 1\) other items in the list, the second-ranked item is ahead of \(k - 2\) items, \(\dots \), and the item at the bottom of the list is not ranked ahead of any other items on the list. For this reason, constraint (5e) allows at most \(k(k - 1)/2\) of the \(\boldsymbol{w}\)-variables to take a value of 1, forcing all other variables (including \(w_{jh}\)) to equal 0. Therefore, constraints (5d)–(5e) together impose preference transitivity on the solution top-k list returned by solving MIP#1.    \(\square \)

Since KEMENY-AGG is a special case of TOP-k-AGG, MIP#1 provides a novel formulation for that problem as well; however, it does not apply to the variant of the problem with ties (see Yoo and Escobedo [24]). It is important to mention that Cook [9] proposed a binary linear programming formulation of KEMENY-AGG using the structure of the assignment problem; however, their set of preference cycle prevention constraint is different from constraints (5d)–(5e).

Next, we present a binary non-linear programming formulation for TOP-k-AGG. The formulation uses the \(\boldsymbol{w}\)-variables defined for MIP#1 as well as the following decision variables:

$$\begin{aligned}&z_{i} = {\left\{ \begin{array}{ll} 1 &{} \text {if}\,i\, \text {is in the top-}k\,\text {list} \\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

The formulation is given by:

$$\begin{aligned} {\begin{matrix} \min _{\boldsymbol{w, z}} \quad &{} \sum \limits _{i \in \boldsymbol{\mathcal {I}} }\sum \limits _{j \in \boldsymbol{\mathcal {I}} } \bigg [(s_{ji} + p\sum \limits _{l \in \boldsymbol{\mathcal {L}}}(1 - \mu _{il})(1 - \mu _{jl}))w_{ij} + s_{ji}z_{i}(1 - z_{j})\bigg ] + \\ {} &{} p \sum _{i, j \in \boldsymbol{\mathcal {I}} , j > i} \sum \limits _{l \in \boldsymbol{\mathcal {L}}} \mu _{il}\mu _{jl}(1 - z_{i})(1 - z_{j}) \end{matrix}}\end{aligned}$$
(6a)
$$\begin{aligned} \text {s.t.} \quad&\sum \limits _{i \in \boldsymbol{\mathcal {I}} }z_{i} = k \end{aligned}$$
(6b)
$$\begin{aligned}&w_{hi} + w_{ij} + w_{jh} \le 2 \qquad \quad \forall h, i, j \in \boldsymbol{\mathcal {I}} , i, j > h, i \ne j \end{aligned}$$
(6c)
$$\begin{aligned}&w_{ij} + w_{ji} = z_{i}z_{j} \qquad \quad \forall i, j \in \boldsymbol{\mathcal {I}} , j > i \end{aligned}$$
(6d)
$$\begin{aligned}&z_{i}, w_{ij} \in \{0, 1\} \qquad \quad \forall i, j \in \boldsymbol{\mathcal {I}} , i \ne j. \end{aligned}$$
(6e)

Objective function (6a) minimizes the cumulative \(K^{(p)}\) distance to the input lists. Constraint (6b) restricts k items to be present in the top-k list. Constraint (6c) imposes preference transitivity only whenever items hij all appear in the list; otherwise it becomes redundant, with the help of constraint (6d). Constraint (6d) enforces that, when both i and j are present in the list, one must proceed the other. Constraint (6e) specifies the domains of the variables. Given a feasible solution, the output top-k items are defined by the set \(\overline{\boldsymbol{\tau } } := \{i \in \boldsymbol{\mathcal {I}} | z_{i} = 1\}\), and the exact rank of item \(i \in \overline{\boldsymbol{\tau } }\) is obtained as \(\overline{\boldsymbol{\tau } }(i) := k - \sum _{j \in \overline{\boldsymbol{\tau } }}w_{ij}\).

The above non-linear optimization model can be linearized using a technique from Glover and Woolsey [13]. Specifically, constraint (6d) can be replaced with three linear constraints for each distinct item pair (ij): \(w_{ij} + w_{ji} \le z_{i}\), \(w_{ij} + w_{ji} \le z_{j}\), and \(w_{ij} + w_{ji} \ge z_{i} + z_{j}- 1\). Similarly, the term \(z_{i}(1 - z_{j})\) in the objective function is replaced by auxiliary continuous variable \(x'_{ij}\) and constraints \(x'_{ij} \ge z_{i}- z_{j}\) and \(x'_{ij} \ge 0\); and the term \((1 - z_{i})(1 - z_{j})\) in the objective function is replaced by auxiliary continuous variable \(x''_{ij}\) and constraints \(x''_{ij} \ge 1 - z_{i}- z_{j}\) and \(x''_{ij} \ge 0\). The latter two cases use the fact the objective function coefficients of \(z_{i}(1 - z_{j})\) and \((1 - z_{i})(1 - z_{j})\) are non-negative, leading to a reduction in the number of constraints required by the linearization. The resulting formulation (MIP#2) is given by:

$$\begin{aligned} {\begin{matrix} \min _{\boldsymbol{w, x', x'', z}} \quad &{} \sum \limits _{i \in \boldsymbol{\mathcal {I}} }\sum \limits _{j \in \boldsymbol{\mathcal {I}} } \bigg [(s_{ji} + p\sum \limits _{l \in \boldsymbol{\mathcal {L}}}(1 - \mu _{il})(1 - \mu _{jl}))w_{ij} + s_{ji}x'_{ij} \bigg ] + \\ {} &{}p \sum _{i, j \in \boldsymbol{\mathcal {I}} , j > i} \sum \limits _{l \in \boldsymbol{\mathcal {L}}} \mu _{il}\mu _{jl}x''_{ij} \end{matrix}}\end{aligned}$$
(7a)
$$\begin{aligned} \text {s.t.} \quad&\text {(6b)}, \text {(6c)}, \text {(6e)} \end{aligned}$$
(7b)
$$\begin{aligned}&w_{ij} + w_{ji} \ge z_{i} + z_{j} - 1 \qquad \quad \forall i, j \in \boldsymbol{\mathcal {I}} , j > i \end{aligned}$$
(7c)
$$\begin{aligned}&w_{ij} + w_{ji} \le z_{i} \qquad \qquad \forall i, j \in \boldsymbol{\mathcal {I}} , i \ne j \end{aligned}$$
(7d)
$$\begin{aligned}&x'_{ij} \ge z_{i} - z_{j} \qquad \qquad \forall i, j \in \boldsymbol{\mathcal {I}} , i \ne j \end{aligned}$$
(7e)
$$\begin{aligned}&\sum _{i, j \in \boldsymbol{\mathcal {I}} }x'_{ij} = k(n - k) \end{aligned}$$
(7f)
$$\begin{aligned}&x''_{ij} \ge 1 - z_{i} - z_{j} \qquad \qquad \forall i, j \in \boldsymbol{\mathcal {I}} , j > i \end{aligned}$$
(7g)
$$\begin{aligned}&\sum _{i, j \in \boldsymbol{\mathcal {I}} , j > i}x''_{ij} = \frac{(n - k)(n - k - 1)}{2} \end{aligned}$$
(7h)
$$\begin{aligned}&x'_{ij} \ge 0 \qquad \qquad \quad \forall i, j \in \boldsymbol{\mathcal {I}} , i \ne j, \end{aligned}$$
(7i)
$$\begin{aligned}&x''_{ij} \ge 0 \qquad \qquad \quad \forall i, j \in \boldsymbol{\mathcal {I}} , j > i. \end{aligned}$$
(7j)

The rationale behind including constraints (7f) and (7h) is the same as constraints (5g) and (5i) in MIP#1.

Next, we define two variants of the preference transitivity constraints utilized in MIP#2.

Proposition 1

Constraint (6c) can be replaced by non-linear constraints

$$\begin{aligned}&w_{hi} + w_{ij} + w_{jh} \le 3 - z_{h}z_{i}z_{j} \qquad \forall i, j > h, \,\, i \ne j, \quad {\boldsymbol{or}} \end{aligned}$$
(8)
$$\begin{aligned}&w_{hi} + w_{ij} + w_{jh} \le 1 + z_{h}z_{i}z_{j} \qquad \forall i, j > h, \,\, i \ne j. \end{aligned}$$
(9)

Furthermore, these constraints can be linearized respectively as

$$\begin{aligned}&w_{hi} + w_{ij} + w_{jh} \le 3 - \frac{1}{3} (z_{h} + z_{i} + z_{j}) \qquad \forall h, i, j \in \boldsymbol{\mathcal {I}} , \,\, i, j > h, i \ne j, \end{aligned}$$
(10)
$$\begin{aligned}&w_{hi} + w_{ij} + w_{jh} \le 1 + \frac{1}{3} (z_{h} + z_{i} + z_{j}) \qquad \forall h, i, j \in \boldsymbol{\mathcal {I}} , \,\, i, j > h, i \ne j. \end{aligned}$$
(11)

Proof

The right-hand side of constraints (8)–(11) becomes 2, as desired, when items hij are all in the solution top-k list, i.e., when \(z_{h} = z_{i} = z_{j} = 1\). For the remaining cases, these constraints become redundant, with the help of constraint (7d). In particular, assume i is not in the top-k list; constraint (7d) enforces that \(w_{ij} + w_{ji} \le 0\) and \(w_{ih} + w_{hi} \le 0\); hence, constraints (8)–(11) effectively reduce to \(w_{jh} \le 1\), which is redundant.    \(\square \)

Replacing constraint (6c) with constraints (10) and (11), respectively, induces two additional MIPs.

MIP#3:

$$\begin{aligned} \min _{\boldsymbol{w, x', x'', z}} \quad&{\text {(7a)}}\\ \text {s.t.} \quad&{\text {(6b)}}, {\text {(6e)}}, {\text {(7c)}}-{\text {(7g)}} \\&w_{hi} + w_{ij} + w_{jh} \le 3 - \frac{1}{3} (z_{h} + z_{i} + z_{j}) \quad \forall h, i, j \in \boldsymbol{\mathcal {I}} , i, j > h, i \ne j. \end{aligned}$$

MIP#4:

$$\begin{aligned} \min _{\boldsymbol{w, x', x'', z}} \quad&{\text {(7a)}}\\ \text {s.t.} \quad&{\text {(6b)}}, {\text {(6e)}}, {\text {(7c)}}-{\text {(7g)}} \\&w_{hi} + w_{ij} + w_{jh} \le 1 + \frac{1}{3} (z_{h} + z_{i} + z_{j}) \quad \forall h, i, j \in \boldsymbol{\mathcal {I}} , i, j > h, i \ne j. \end{aligned}$$

4 Polyhedral Comparison

Next, we compare the strength of the proposed MIPs based on their linear programming (LP) relaxation models. First, we compare the strength of MIPs #2, #3, and #4. To that end, notice that these three MIPs become equivalent when \(k \le 2\)—when the preference transitivity relations are irrelevant—or when \(n = k\)—when all items appear in the solution top-k list. Afterwards, we show that each of these formulations is stronger than MIP#1. For the remainder of the paper, let \(\mathcal {P}^{1}, \mathcal {P}^{2}, \mathcal {P}^{3}, \mathcal {P}^{4}\) be the polyhedral corresponding to the LP relaxations of MIPs #1, #2, #3, #4, respectively.

Theorem 2

For any instance of TOP-k-AGG, \(\mathcal {P}^{4} \subseteq \mathcal {P}^{2} \subseteq \mathcal {P}^{3}\), and these inclusions can be strict.

Proof

Note that MIPs #2, #3, and #4 differ only in their preference transitivity constraints. First, we show that \(\mathcal {P}^{4} \subseteq \mathcal {P}^{2} \subseteq \mathcal {P}^{3}\).

Since \(0 \le z_{i} \le 1 \forall i \in \boldsymbol{\mathcal {I}} \), for every feasible solution in \(\mathcal {P}^{2}, \mathcal {P}^{3}, \mathcal {P}^{4}\), we have that \((z_{h} + z_{i} + z_{j})/3 \le 1\, \forall h, i, j \in \boldsymbol{\mathcal {I}} , i, j > h, i \ne j\). Letting \((\boldsymbol{w}, \boldsymbol{x'}, \boldsymbol{x''}, \boldsymbol{z})^{(4)} \in \mathcal {P}^{4}\) be a feasible solution to MIP#4, we have that

$$\begin{aligned} w_{hi}^{(4)} + w_{ij}^{(4)} + w_{jh}^{(4)} \le 1 + \frac{1}{3} ( z_{i}^{(4)} + z_{j}^{(4)} + z_{h}^{(4)}) \le 2 \le 3 - \frac{1}{3} ( z_{i}^{(4)} + z_{j}^{(4)} + z_{h}^{(4)}). \end{aligned}$$

Therefore, all feasible solutions to MIP#4 are also feasible to MIPs #2 and #3. Using the same logic, all feasible solutions to MIP#2 are feasible to MIP#3. This gives that \(\mathcal {P}^{4} \subseteq \mathcal {P}^{2} \subseteq \mathcal {P}^{3}\).

To show that the inclusion \(\mathcal {P}^{4} \subseteq \mathcal {P}^{2}\) can be strict, consider a small instance with \(\boldsymbol{\mathcal {I}} = \{1, 2, 3, 4\}\) and \(k = 3\). Fix the solution \((\boldsymbol{w}, \boldsymbol{x'}, \boldsymbol{x''}, \boldsymbol{z})^{(2)} \in \mathcal {P}^{2}\) as

$$\begin{aligned}&x_{14}'^{(2)} = x_{24}'^{(2)} = x_{34}'^{(2)} = 0.24, \quad w_{12}^{(2)} = w_{23}^{(2)} = w_{31}^{(2)} = 0.62, \quad w_{14}^{(2)} = w_{24}^{(2)} = w_{34}^{(2)} = 0.38,\\ {}&z_{1}^{(2)} = z_{2}^{(2)} = z_{3}^{(2)} = 0.81, \quad z_{4}^{(2)} = 0.57; \end{aligned}$$

with all other variables equal to 0. By inspection, this solution satisfies all constraints of MIP#2. However, we have that

$$\begin{aligned} w_{12}^{(2)} + w_{23}^{(2)} + w_{31}^{(2)} = 1.86 \nleq 1 + \frac{0.81 + 0.81 + 0.81}{3} = 1.81. \end{aligned}$$

This indicates that this solution does not satisfy the preference transitivity constraints of MIP#4.

Next, we use a similar process to show that the inclusion \(\mathcal {P}^{2} \subseteq \mathcal {P}^{3}\) can be strict. Consider a small instance with \(\boldsymbol{\mathcal {I}} = \{1, 2, 3, 4\}\) and \(k = 3\). Fix the solution \((\boldsymbol{w}, \boldsymbol{x'}, \boldsymbol{x''}, \boldsymbol{z})^{(3)} \in \mathcal {P}^{3}\) as

$$\begin{aligned}&x_{14}'^{(3)} = x_{24}'^{(3)} = x_{34}'^{(3)} = 0.4, \quad w_{12}^{(3)} = w_{23}^{(3)} = w_{31}^{(3)} = 0.7, \quad w_{14}^{(3)} = w_{24}^{(3)} = w_{34}^{(3)} = 0.3,\\ {}&z_{1}^{(2)} = z_{2}^{(3)} = z_{3}^{(3)} = 0.85, \quad z_{4}^{(3)} = 0.45; \end{aligned}$$

with all other variables equal to 0. By inspection, this solution satisfies all constraints of MIP#3. However, we have that

$$\begin{aligned} w_{12}^{(3)} + w_{23}^{(3)} + w_{31}^{(3)} = 2.1 \nleq 2. \end{aligned}$$

This indicates that this solution does not satisfy the preference transitivity constraints of MIP#2.    \(\square \)

Theorem 3

For any instance of TOP-k-AGG, \(proj_{\boldsymbol{w}} \mathcal {P}^{2}, \text {proj}_{\boldsymbol{w}} \mathcal {P}^{3}, \text {proj}_{\boldsymbol{w}} \mathcal {P}^{4} \subseteq \text {proj}_{\boldsymbol{w}} \mathcal {P}^{1}\), and these inclusions can be strict.

Proof

First, we prove that proj\(_{\boldsymbol{w}}\mathcal {P}^{3} \subseteq \) proj\(_{\boldsymbol{w}}\mathcal {P}^{1}\). We show that, starting from an arbitrary solution \((\boldsymbol{w}, \boldsymbol{x'}, \boldsymbol{x''}, \boldsymbol{z}) \in \mathcal {P}^{3}\), we can deduce a solution \((\boldsymbol{u}, \boldsymbol{w}, \boldsymbol{w'}, \boldsymbol{w''}) \in \mathcal {P}^{1}\). To this end, we define the following affine mappings of variables from \(\mathcal {P}^{3}\) to \(\mathcal {P}^{1}\):

$$\begin{aligned}&u_{it} = \frac{z_{i}}{k} \quad \forall i \in \boldsymbol{\mathcal {I}} , \,\,\, \forall t \in \{1, \dots , k\} \rightarrow \sum _{t = 1}^{k}u_{it} = z_{i} \quad \forall i \in \boldsymbol{\mathcal {I}} , \end{aligned}$$
(12a)
$$\begin{aligned}&w'_{ij} = x'_{ij} \quad \forall i, j \in \boldsymbol{\mathcal {I}} , \,\, i \ne j,\end{aligned}$$
(12b)
$$\begin{aligned}&w''_{ij} = x''_{ij} \quad \forall i, j \in \boldsymbol{\mathcal {I}} , \,\, j > i. \end{aligned}$$
(12c)

Mapping (12b)–(12c) guarantees that the objective function values achieved by the respective feasible points are equal. To establish that \(\text {proj}_{\boldsymbol{w}}\mathcal {P}^{3} \subseteq \text {proj}_{\boldsymbol{w}}\mathcal {P}^{1}\), it is sufficient to show that, given a feasible solution in \(\mathcal {P}^{3}\), the mapped variables are guaranteed to satisfy all constraints of MIP#1 (i.e., this point belongs to \(\mathcal {P}^{1}\)).

Consider constraint (5b). For any \(t \in \{1, \dots , k\}\), we have

$$\begin{aligned} \sum _{i \in \boldsymbol{\mathcal {I}} }u_{it} = \sum _{i \in \boldsymbol{\mathcal {I}} }\frac{z_{i}}{k} = \frac{\sum _{i \in \boldsymbol{\mathcal {I}} }z_{i}}{k} \xrightarrow {\sum _{i \in \boldsymbol{\mathcal {I}} }z_{i} = k} \sum _{i \in \boldsymbol{\mathcal {I}} }u_{it} = 1. \end{aligned}$$

Therefore, mapping (12a) provides a solution that is guaranteed to satisfy constraint (5b).

Consider constraint (5c). For every \(i \in \boldsymbol{\mathcal {I}} \), we have

$$\begin{aligned} \sum _{t = 1}^{k}u_{it} = \sum _{t = 1}^{k}\frac{z_{i}}{k} = \frac{kz_{i}}{k} = z_{i} \le 1. \end{aligned}$$

The last inequality follows from the fact that the \(\boldsymbol{z}\)-variables are binary. Therefore, mapping (12a) provides a solution that is guaranteed to satisfy constraint (5c).

Next, consider constraint (5d); we focus on the maximum value of the right-hand side of this constraint given mapping (12a). For any arbitrary item-pair (ij) and any \(t \in \{1, \dots , k - 1\}\) we have

$$\begin{aligned} \sum _{t' = 1}^{t}u_{it'} + \sum _{t'' = t + 1}^{k}u_{jt''} - 1&= \sum _{t' = 1}^{t}\frac{z_{i}}{k} + \sum _{t'' = t + 1}^{k}\frac{z_{j}}{k}- 1 \\ {}&= \frac{tz_{i}}{k} + \frac{(k - t)z_{j}}{k} - 1 \\ {}&\le \frac{t}{k} + \frac{k - t}{k} - 1 = \frac{k}{k}- 1 = 1 - 1 = 0. \end{aligned}$$

The above equation states that using mapping (12a), the left-hand side values of constraint (5d) will be non-positive. Since \(w_{ij}\ge 0\), mapping (12a) provides a solution that is guaranteed to satisfy constraint (5d).

Next, consider constraint (5e). By summing over constraint (7d), we have

$$\begin{aligned}&2\sum _{i, j \in \boldsymbol{\mathcal {I}} }w_{ij} \le (k - 1)\sum _{i \in \boldsymbol{\mathcal {I}} }z_{i} = k (k - 1)\\ \rightarrow&\sum _{i, j \in \boldsymbol{\mathcal {I}} }w_{ij} \le \frac{k(k - 1)}{2}, \end{aligned}$$

which is exactly constraint (5e).

Finally, consider constraints (5f)–(5i). Mappings (12a)–(12c) imply that all feasible solutions to constraints (7e)–(7h) are feasible to constraints (5f)–(5i). Putting all pieces together, we have \(\text {proj}_{\boldsymbol{w}}\mathcal {P}^{3} \subseteq \text {proj}_{\boldsymbol{w}} \mathcal {P}^{1}\).

Note that the preference cycle-prevention constraints of MIP#3 have no counterpart in MIP#1. Therefore, we can show that the inclusion \(\text {proj}_{\boldsymbol{w}}\mathcal {P}^{3} \subseteq \text {proj}_{\boldsymbol{w}}\mathcal {P}^{1}\) can be strict by providing a solution that satisfies constraints (7c)–(7f) but violates preference cycle-prevention constraint (10), as this solution satisfies all constraints of MIP#1. There is an infinite number of such solutions; for example, consider a small instance with \(\boldsymbol{\mathcal {I}} = \{1, 2, 3, 4\}\) and \(k = 3\). Fix the solution \((\boldsymbol{w}, \boldsymbol{x'}, \boldsymbol{x''}, \boldsymbol{z})^{(3)}\) as

$$\begin{aligned}&x_{14}'^{(3)} = x_{24}'^{(3)} = x_{34}'^{(3)} = 0.44, \quad w_{12}^{(3)} = w_{23}^{(3)} = w_{31}^{(3)} = 0.72, \quad w_{14}^{(3)} = w_{24}^{(3)} = w_{34}^{(3)} = 0.28,\\ {}&z_{1}^{(2)} = z_{2}^{(3)} = z_{3}^{(3)} = 0.86, \quad z_{4}^{(3)} = 0.42; \end{aligned}$$

with all other variables equal to 0. By inspection, this solution satisfies constraints (7c)–(7f); however, it violates the preference transitivity constraints involved in MIP#3, as we have

$$\begin{aligned} w_{12} + w_{23} + w_{31} = 2.16 \not \le 3 - (0.86 + 0.86 + 0.86)/3 = 2.14. \end{aligned}$$

Finally, from Theorem 2, we have that \(\mathcal {P}^{4} \subseteq \mathcal {P}^{2} \subseteq \mathcal {P}^{3}\); therefore, we can conclude that \(\text {proj}_{\boldsymbol{w}} \mathcal {P}^{2}, \text {proj}_{\boldsymbol{w}} \mathcal {P}^{4} \subseteq \text {proj}_{\boldsymbol{w}} \mathcal {P}^{1}\), and these inclusions can be strict.    \(\square \)

5 Concluding Remarks

This paper studies the top-k list aggregation problem, which includes Kemeny aggregation as a special case. It presents a binary non-linear and four mixed-integer linear programming formulations. Furthermore, it studies the strength of the four mixed-integer linear programming formulations using polyhedral analysis. Our findings shows that the presented formulations can be ordered based on the strength of their LP relaxations. The strongest formulation is induced by a novel set of preference cycle-prevention constraints tailored to the specific structure of the top-k list aggregation problem introduced herein.

Future research will explore heuristic and approximation algorithms for this problem. Additionally, investigating whether lower bounding techniques of Kemeny aggregation [3] can be modified for the top-k list aggregation problem can be another avenue of research.