Theoretical and empirical evaluation of data reduction for exact Kemeny Rank Aggregation

Betzler, Nadja; Bredereck, Robert; Niedermeier, Rolf

doi:10.1007/s10458-013-9236-y

Theoretical and empirical evaluation of data reduction for exact Kemeny Rank Aggregation

Published: 04 October 2013

Volume 28, pages 721–748, (2014)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Theoretical and empirical evaluation of data reduction for exact Kemeny Rank Aggregation

Download PDF

Nadja Betzler¹,
Robert Bredereck¹ &
Rolf Niedermeier¹

1003 Accesses
38 Citations
Explore all metrics

Abstract

Kemeny Rank Aggregation is a consensus finding problem important in many areas ranging from classical voting over web search and databases to bioinformatics. The underlying decision problem Kemeny Score is NP-complete even in case of four input rankings to be aggregated into a “median ranking”. We analyze efficient polynomial-time data reduction rules with provable performance bounds that allow us to find even all optimal median rankings. We show that our reduced instances contain at most candidates where $d_a$ denotes the average Kendall’s tau distance between the input votes. On the theoretical side, this improves a corresponding result for a “partial problem kernel” from quadratic to linear size. In this context we provide a theoretical analysis of a commonly used data reduction. On the practical side, we provide experimental results with data based on web search and sport competitions, e.g., computing optimal median rankings for real-world instances with more than 100 candidates within milliseconds. Moreover, we perform experiments with randomly generated data based on two random distribution models for permutations.

Parameterized aspects of distinct Kemeny rank aggregation

Article 15 August 2024

Parameterized Aspects of Distinct Kemeny Rank Aggregation

On Maximum Rank Aggregation Problems

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

We investigate the effectiveness of polynomial-time data reduction for computing optimal solutions of the NP-hard Kemeny Rank Aggregation problem. Kemeny’s corresponding voting scheme can be described as follows. An election $(V,C)$ consists of a set $V$ of $n$ votes and a set $C$ of $m$ candidates. A vote or a ranking is a total order of all candidates. For instance, in case of three candidates $a,b,c$, the order $c>b>a$ means that candidate $c$ is the best-liked one and candidate $a$ is the least-liked one. For each pair of votes $v,w$, Kendall’s tau distance (also known as “inversion distance” since it measures the number of inversions between two permutations) between $v$ and $w$ is defined as

$$\begin{aligned} {{\mathrm{KT-distance}}}(v,w) = \sum _{\{c,d\} \subseteq C}d_{v,w}(c,d), \end{aligned}$$

where $d_{v,w}(c,d)$ is set to $0$ if $v$ and $w$ rank $c$ and $d$ in the same order, and is set to $1$, otherwise. The score of a ranking $r$ with respect to an election $(V,C)$ is defined as $\sum _{v \in V} {{\mathrm{KT-distance}}}(r,v)$. A ranking $r$ with a minimum score is called a Kemeny ranking of $(V,C)$ and its score is the Kemeny score of $(V,C)$. The central problem considered in this work is as follows:

Kemeny Rank Aggregation

Input: An election $(V,C)$.

Task: Find a Kemeny ranking of $(V,C)$.

Its decision variant Kemeny Score asks whether there is a Kemeny ranking of $(V,C)$ with score at most some additionally given positive integer $k$. The Kemeny Rank Aggregation problem has numerous applications, ranging from building meta-search engines for the web or spam detection [16] over databases [17] to the construction of genetic maps in bioinformatics [22]. Kemeny rankings are also desirable in classical voting scenarios such as the determination of a president (see, for example, http://www.votefair.org) or the selection of the best qualified candidates for job openings. The wide range of applications is due to the fulfillment of many desirable properties from the social choice point of view. For example, the Kemeny rule is the unique preference function that is neutral, consistent, and satisfies the Condorcet property [40]. These three properties are defined as follows.

Neutrality: The candidate names do not influence the Kemeny ranking, that is, given a Kemeny ranking $r$ for an election $(\{v_1,v_2,\dots ,v_n\},C)$, for every permutation $\pi : C \rightarrow C$, $\pi (r)$ is a Kemeny ranking for $(\{\pi (v_1), \pi (v_2), \dots , \pi (v_n)\},C)$.^{Footnote 1}

Consistency: Let $(V_1,C)$ and $(V_2,C)$ be two elections with disjoint vote sets, that is, $V_1 \cap V_2 = \emptyset $. If $r$ is a Kemeny ranking for $(V_1,C)$ and for $(V_2,C)$, then $r$ is also a Kemeny ranking for $(V_1 \cup V_2,C)$.

Condorcet property: If there is a candidate (Condorcet winner) who is better than every other candidate in more than half of the votes, then this candidate is also ranked first in every Kemeny ranking.

Previous work.

First computational complexity studies of Kemeny Score go back to Bartholdi et al. [3], showing its NP-completeness. Dwork et al. [16] showed that the problem remains NP-hard even in the case of four votes—small errors in their proof have been corrected by Biedl et al. [8]. Moreover, Dwork et al. [16] demonstrated the usefulness of Kemeny Score in aggregating web search results and provided several approximation and heuristic algorithms. In earlier work, Hemaspaandra et al. [21] derived computational complexity classifications for generalized versions of Kemeny Score. Recent papers showed constant-factor approximability [1, 41] and an (impractical) PTAS [24]. Schalekamp and van Zuylen [35] provided a thorough experimental study of approximation and heuristic algorithms. Another extensive experimental study compares 104 (mostly heuristic) algorithms solving Kemeny Rank Aggregation in different situations [2]. Due to the importance of computing provably optimal solutions, there have been experimental studies in this regard as well [13, 14]: An integer linear program and a branch-and-bound approach were applied to random instances generated under a noise model. Another line of research concerns the interpretation of Kemeny rankings as maximum likelihood estimators and several generalizations thereof [19, 29, 30, 39].

From a parameterized complexity perspective [15, 20, 31], the following is known. First fixed-parameter tractability results have been shown with respect to the single parameters number of candidates, Kemeny score, maximum range of candidate positions, and average KT-distance $d_a$ [6]. The average KT-distance

$$\begin{aligned} d_a := \sum _{v,w \in V}{{\mathrm{KT-distance}}}(v,w) / ({n (n-1)}), \end{aligned}$$

will also play a central role in this work. The range of positions of some candidate $x$ is defined as the difference between its maximum position and its minimum position where the position of $x$ in a vote is defined as the number of candidates that are ranked better than $x$. Kemeny Score remains NP-hard when the average range of candidate positions is two [6], excluding hope for fixed-parameter tractability with respect to this parameterization. Simjour [36] further introduced the parameters “Kemeny score divided by the number of votes” (which is identical to $d_a$ up to some constants factors [36]) and “maximum KT-distance between two input votes” also showing fixed-parameter tractability and improved the running times for the fixed-parameter algorithms corresponding to the parameterizations by average KT-distance and Kemeny score. Very recently, Nishimura and Simjour [32] developed a fixed-parameter algorithm with respect to the parameter “Kemeny score divided by the number of votes” that enumerates all Kemeny rankings. Karpinski and Schudy [23] and Fernau et al. [18] devised subexponential-time fixed-parameter algorithms for the parameters Kemeny score, average KT-distance, and Kemeny score divided by the number of votes. Mahajan et al. [26] studied above guarantee parameterization with respect to the Kemeny score. Finally, introducing the new concept of partial kernelization, it has been shown that with respect to the average KT-distance $d_a$ one can compute in polynomial time an equivalent instance where the number of candidates is at most $162 d_a^2 + 9d_a$ [7].

Formally, the concept of partial kernelization is defined as follows. Let $(I,k)$ be an instance of a parameterized problem $P$, where $I \in \Sigma ^*$ denotes the input instance and $k$ a parameter. Let $d: \Sigma ^* \rightarrow \mathbb {N}$ be a computable function such that $P$ is fixed-parameter tractable with respect to $d(I)$. Then $P$ admits a partial kernel if there is a polynomial-time algorithm that computes an instance $(I',k')$ of $P$ such that $(I,k)$ is a yes-instance if and only if $(I',k')$ is a yes-instance, $k' \le g_1(k)$, and $d(I') \le g_2(k)$ for computable functions $g_1$ and $g_2$. A (partial) problem kernel is called enumerative if there is a one-to-one correspondence between solutions to the original instances and solutions to the kernel [37].

Our contributions.

Our central focus is on mathematically analyzing and empirically evaluating polynomial-time executable data reduction algorithms for Kemeny Rank Aggregation. These rules preserve the possibility to find optimal (not only approximate) solutions on the one hand, and provide an efficient and effective preprocessing with provable performance guarantee on the other hand. We remark that a similar data reduction has been used by Cohen et al. [12] to improve the approximation quality (but not running time) of a heuristic algorithm for a closely related (more general) problem. In our setting, however, data reduction is shown to be the key to significantly improve the running time of exact solution strategies.

On the theoretical side, we improve the previous partial kernel size [7] from $162 d_a^2 + 9d_a$ candidates to candidates. More specifically, we investigate data reduction rules exploiting majorities, ending up with an enumerative partial kernel consisting of at most candidates. Next, we show that in case of dropping the quest for an enumerative kernel, our previous data reduction rules are subsumed by a data reduction rule based on the well-known Extended Condorcet Criterion [38]. We note that, very recently, Simjour [37] showed a further improvement for the enumerative kernel to $4d_a$ candidates and for the relaxed case to even $(2+\epsilon )d_a$ candidates for $\epsilon >0$.

On the practical side, we provide empirical evidence for the usefulness of data reduction rules associated with the above mentioned kernelization. An essential property of the data reduction rules is that they can break instances into several subinstances to be handled independently, that is, the relative order between the candidates in two different subinstances in a Kemeny ranking is already determined. This also means that for hard instances which we could not completely solve we were still able to compute “partial rankings” of the top and bottom ranked candidates. After exhaustive application of our data reduction rules, we employ some of the known fixed-parameter algorithms [6, 11, 36] and a known integer linear program formulation [13] to solve small parts of the instances remaining after data reduction. Our experiments showed that the data reduction rules allow for significantly decreased running times; they are crucial to solving larger instances. For example, using data reduction rules decreased the running time with the Gurobi ILP-solver for a test instance from winter sports with 69 candidates from 3.9 to 0.11 s. Furthermore, an instance generated from web search with 240 candidates, which could not be solved within 70 s without data reduction, was solved in 3 s.

2 Data reduction and a linear partial problem kernel

In this section, we examine two simple properties of Kemeny rankings which directly yield data reduction rules. In Sect. 2.2, we analyze natural extensions of a majority-based property including limitations of the corresponding data reduction rules. In Sect. 2.3, we show that the data reduction from Sect. 2.2 provides a linear partial kernel for the parameter average KT-distance that preserves all Kemeny rankings. In Sect. 2.4, we analyze a known extension of the Condorcet property and show how the corresponding data reduction rules relate to the data reduction from Sect. 2.2. In Sect. 2.5, we briefly discuss the tightness of our kernelization upper bound and highlight some practical consequences of our theoretical findings. We start in Sect. 2.1 with some definitions and sketch some relevant previous results [7].

2.1 Definitions and previous results

As to the notation of rankings, we write the list of candidates separated by the $>$-symbol in the corresponding preference order. For example, $a>b>c$ indicates that $a$ is preferred to $b$ and $c$, and $b$ is preferred to $c$. When we express partial information about the preference order we use quotation marks. For example, “$a>b$” means that candidate $a$ is preferred to $b$, without giving further information about other candidates. Furthermore, let $\#_v(a>b)$ denote the number of votes with “$a>b$”. A candidate pair $\{a,b\}$ is called a conflict pair if $\#_v(a>b)>0$ and $\#_v(b>a)>0$.

The data reduction framework from previous work [7] introduces a “dirtiness concept” and shows that one can delete some “non-dirty candidates” by a data reduction rule leading to a partial kernel with respect to the average KT-distance. The “dirtiness” of a pair of candidates is measured by the degree of agreement of the votes for this pair. To this end, we introduce the following notation. For an election $(V,C)$, two candidates $c,c' \in C$, and a rational number $s \in $ $]0.5,1]$, we write

$$\begin{aligned} c \ge _{s} c', \end{aligned}$$

if at least $\lceil s \cdot |V| \rceil $ of the votes prefer $c$ to $c'$, that is, $\#_v(c>c') \ge s \cdot |V|$. A candidate pair $\{c,c'\}$ is dirty according to the $\ge _s$ -majority if neither $c\ge _{s} c'$ nor $c'\ge _{s} c$. The remaining pairs are called non-dirty according to the $\ge _s$ -majority. This directly leads to the parameter $n_d^s$ denoting the number of dirty pairs according to the $\ge _s$-majority. To simplify matters, we write “” instead of “$ \ge _s$ with ”. In this sense, let denote the maximum number of dirty pairs according to $\ge _s$-majorities with . When the value of $s$ is clear from the context, then we speak of “dirty pairs” and omit “according to the $\ge _s$-majority”. Previous work only considered -majorities and provided a data reduction rule such that the number of candidates in a reduced instance is at most quadratic in . In this work, we provide a linear partial kernel with respect to $n_d^s$ according to the $\ge _s$-majority for and show that this leads to a linear partial kernel with respect to $d_a$.

We say that $c$ and $c'$ are ordered according to the $\ge _s$ -majority in a ranking $l$ if $c>c'$ in $l$ and $c\ge _{s}c'$. If all candidate pairs are non-dirty with respect to the $\ge _s$-majority for an , then there exists a $\ge _s$ -majority order, that is, a ranking in which all candidate pairs are ordered according to the $\ge _s$-majority [7]. Furthermore, the corresponding -majority order can be found in polynomial time and is a Kemeny ranking [7]. Candidates appearing only in non-dirty pairs according to a $\ge _s$-majority are called non-dirty candidates and all remaining candidates are dirty candidates according to the $\ge _s$-majority. Note that with this definition a non-dirty pair can also be formed by two dirty candidates. See Table 1 for an overview of partial kernelization results versus polynomial-time solvability.

Table 1 Partial kernelization and polynomial-time solvability

Full size table

We end with some notation needed to state our data reduction rules. For a candidate subset $C' \subseteq C$, a ranking fulfills the condition $C'> C {\setminus } C'$ if every candidate from $C'$ is preferred to every candidate from $C {\setminus } C'$. A subinstance of $(V,C)$ induced by a candidate subset $C' \subseteq C$ is given by $(V',C')$ where every vote in $V'$ one-to-one corresponds to a vote in $V$ keeping the relative order of the candidates from $C'$.

2.2 Exploiting -majorities

We generalize a well-known property of Kemeny rankings to develop data reduction rules: If there is a specific candidate $a$ such that all voters agree on each relative ordering of $a$ and some other candidate, then in every Kemeny ranking the relative ordering of $a$ and any other candidate is consistent with the votes. Formally, we have the following observation.

Observation 1

Let $a \in C$ be a non-dirty candidate with respect to the $\ge _{1}$-majority and $b \in C{\setminus } \{a\}$. If $a \ge _{1} b$, then in every Kemeny ranking “$ a > b $”; if $b \ge _{1} a$, then in every Kemeny ranking “$ b > a $”.

This observation directly implies a simple data reduction rule which may split the instance into two subinstances: one containing all votes restricted to the candidates that are preferred to $a$ by all votes and one containing all votes restricted to the remaining candidates (except $a$). In the remaining part of this subsection, we show how to extend this observation with respect to $\ge _{s}$-majorities, that is, to cases where only a $\ge _{s}$-majority of the voters instead of all voters agree on the relative orderings. We investigate to which $\ge _s$-majorities Observation 1 extends. An overview of properties for a Kemeny ranking for different values of $s$ is provided in Table 2.

The -Majority Rule. It is easy to see that Observation 1 cannot hold for arbitrary $\ge _{s}$-majorities. For example with the observation would already be contradictory for the election consisting of two votes: $a>b$ and $b>a$. The following lemma shows that Observation 1 can be generalized for all $\ge _{s}$-majorities with .

Lemma 1

Let $a \in C$ be a non-dirty candidate with respect to the $\ge _{s}$-majority and $b \in C{\setminus } \{a\}$ with . If $a \ge _{s} b$, then in every Kemeny ranking “$ a > b $”; if $b \ge _{s} a$, then in every Kemeny ranking “$ b > a $”.

Table 2 Properties “induced” by $\ge _s$-majorities for different values of $s$

Full size table

Proof

Let the partial score of a candidate subset $C'$ be the Kemeny score of the subinstance induced by $C'$. We consider the case ; the case and the proof for $\ge _{s}$-majorities with follow in complete analogy. The proof is by contradiction. Assume that there is a Kemeny ranking $l$ with “$b > D > a$” for some $D \subseteq C {\setminus }\{a,b\}$. Since $a$ is non-dirty, for every candidate $d \in D$ it must either hold that or . Let and . Consider the ranking $l'$ obtained from $l$ through replacing

$$\begin{aligned} \text {``}b >D> a\text {''}\quad \text {by}\quad \text {``}D_2 >a>b>D_1\text {''}, \end{aligned}$$

where the positions of all other candidates remain unchanged and the candidates within $D_1$ and $D_2$ have the same relative order as within $D$.

We show that the score of $l$ is greater than the score of $l'$, contradicting that $l$ is a Kemeny ranking. The only pairs of candidates that have different orders in $l$ and $l'$ and thus can contribute with different partial scores to the scores of $l$ and $l'$ are $\{a,d_1\}$ and $\{d_2,d_1\}$ for all $d_1 \in D_1 \cup \{b\}$ and all $d_2 \in D_2$. Consider any $d_1 \in D_1 \cup \{b\}$ and $d_2 \in D_2$. Since and , the intersection of these two sets must contain at least $|V|/2$ elements, that is, there must be at least $|V|/2$ votes with “$d_2>a>d_1$”. Thus, the partial score of $\{d_2,d_1\}$ in $l$ is at least as high as its partial score in $l'$. The partial score of every pair $\{a,d_1\}$ with $d_1 \in D_1 \cup \{b\}$ in $l'$ is strictly less than the partial score in $l$. Since $ |D_1 \cup \{b\}| \ge 1$, the score of $l'$ is smaller than the score of $l$ and thus $l$ cannot be a Kemeny ranking, a contradiction. $\square $

As a direct consequence of Lemma 1 we can partition the candidates of an election $(V,C)$ as follows. Let $N:=\{n_1, \dots , n_t\}$ denote the set of non-dirty candidates with respect to the -majority such that for $1 \le i \le t-1$. Then,

-Majority Rule. Let $(V,C)$ be an election and $N$ and $D_0, \dots , D_t$ be the sets of non-dirty and dirty candidates as specified above. Replace the original instance by the $t+1$ subinstances induced by $D_i$ for $i \in \{0, \dots , t\}$.

The soundness of the -Majority Rule follows directly from Lemma 1. It remains to discuss the running time of one application of the data reduction rule. The application of the -Majority Rule works as follows. In a first step, compute for each pair of candidates the majority ratio, that is, for each candidate pair $\{a,b\}$ determine how many votes prefer $a$ to $b$ and vice versa. This takes $O(|V| \cdot |C|^2)$ time and a table storing this information takes $O(|C|^2)$ space. In a second step, decide with $O(|C|)$ table lookups for each candidate whether it is dirty or non-dirty according to the -majority. This gives the candidate set $N$ as specified above. A third step is to compute the candidate sets $D_0, \dots , D_t$. For each set, we need at most $O(|C|)$ table lookups. Finally, computing the $t+1$ new instances takes $O(|V| \cdot |C|^2)$ time, because one just copies the original instance, leaving out the candidates that are not in the corresponding candidate sets. Since $t \le |C|$, the overall running time is $O(|V|\cdot |C|^2)$.

Rules for weaker majorities.

The existence of a data reduction rule analogously to the -Majority Rule for $\ge _s$-majorities for would be desirable since such a rule might be more effective: There are instances for which a candidate is dirty according to the -majority but non-dirty according to a $\ge _s$-majority with . Hence, for many instances, the number of dirty pairs according to the -majority is higher than according to smaller values of $s$. In the following, we discuss why an analogous $s$-Majority Rule with cannot work. The decisive point of the -Majority Rule is that, in a Kemeny ranking, every non-dirty candidate must be ordered according to the -majority with respect to every other candidate.

To this end, we start with the introduction of the concept of the $\ge _s$ -majority order.

Definition 1

Let $(V,C)$ be an election and $r$ be a ranking over $C$ and $s \in [0,1]$. We say $r$ is a $\ge _s$-majority order if the following holds:

$$\begin{aligned} \forall \{a,b\} \subseteq C: a \ge _s b \Rightarrow \text {``}a > b\text {'' in }r. \end{aligned}$$

Informally speaking, a $\ge _s$-majority order fully respects all $\ge _s$-majority relations.

For the -majority, instances without dirty candidates are polynomial-time solvable [7]. More specifically, the -majority order is a Kemeny ranking. However, a simple example shows that, for any , a $\ge _s$-majority order does not always exist.

Example 1

Consider the election consisting of the three candidates $a,b$, and $c$ and the three votes $a>b>c$, $b>c>a$, and $c>a>b$. Here, , , and . No linear order fulfills all three relations.

When the $\ge _s$-majority order does not exist, then the ordering for the non-dirty candidates is not clear. Hence, it is not even possible to compute the sets $D_1, \dots , D_t$ as described in the -Majority Rule. It remains to study $s$-values for . We show that in such cases a pair consisting of a dirty and a non-dirty candidate needs not be ordered according to the $\ge _s$-majority.

Proposition 1

Consider a $\ge _s$-majority for any rational $s \in \ ] 2/3, 3/4 [ $. For a non-dirty candidate $x$ and a dirty candidate $y$, $x \ge _s y$ does not imply $x>y$ in a Kemeny ranking.

Proof

Let $s_1$ and $s_2$ be two positive integers such that $s = s_1/s_2$. We construct an election such that there is a non-dirty candidate $x$ and a dirty candidate $y$ with $x \ge _{s} y$ but “$y>x$” in every Kemeny ranking. The set of candidates is $\{x,y\} \cup A$ with $A:=\{a_1,a_2,\dots ,a_l\}$ being the set of dummy candidates. The number $l$ of dummy candidates directly depends on $s$ and will be discussed later. For now, let $l$ be a fixed integer greater than one. We have the following votes, where $A$ stands for $a_1>a_2>\ldots >a_l$:

$$\begin{aligned} \begin{array}{llll} \bullet &{}\,s_1 \cdot s_2 - s_1^2 &{}\quad \text {votes of type~1:}\,\, x > y > A,\\ \bullet &{}\,2s_1^2- s_1\cdot s_2 &{}\quad \text {votes of type~2:}\,\, A > x > y,\\ \bullet &{}\,s_1 \cdot s_2 - s_1^2 &{}\quad \text {votes of type~3:}\,\, y > A > x. \end{array} \end{aligned}$$

We first show that there is a positive number of votes of every type and the total number of votes is $s_1 \cdot s_2$: The total number of votes is

$$\begin{aligned} s_1 \cdot s_2 - s_1^2 + 2s_1^2- s_1\cdot s_2 + s_1 \cdot s_2 - s_1^2 = s_1 \cdot s_2. \end{aligned}$$

Considering the number of votes of types 1 and 3, recall that $3/4 > s_1/s_2$ and thus $s_2 > 4/3 \cdot s_1$. Hence, for both types the number of votes is

$$\begin{aligned} s_1 \cdot s_2 - s_1^2 > s_1 \cdot (4/3 \cdot s_1 -s_1) > 0. \end{aligned}$$

Regarding votes of type 2, we use the trivial bound that $s_1/s_2 > 1/2$ and thus their number is

$$\begin{aligned} 2s_1^2- s_1\cdot s_2 > s_1 \cdot (2s_1 - 2s_1) = 0. \end{aligned}$$

Now, we show that $x$ is non-dirty and $x \ge _s y$: The number of votes with “$a>x$” for $a \in A$ is

$$\begin{aligned} 2s_1^2- s_1\cdot s_2+s_1 \cdot s_2 - s_1^2 = s_1^2 = s \cdot |V| \end{aligned}$$

and the number of votes with “$x>y$” is

$$\begin{aligned} s_1 \cdot s_2 - s_1^2 + 2s_1^2- s_1\cdot s_2 = s_1^2 = s \cdot |V|. \end{aligned}$$

Thus, $x$ is non-dirty according to the $\ge _s$-majority and $x \ge _s y$.

In the following, we show that the score of “$y>A>x$” is smaller than the score of every other ranking and, hence, there is no Kemeny ranking in which $x$ and $y$ are ordered according to the $\ge _s$-majority.

Having “$a_1>a_2>\ldots >a_l$” in every vote, we also have “$a_1>a_2>\ldots >a_l$” in every Kemeny ranking (see, e.g., [6]). Distinguishing three cases, we first show that in every Kemeny ranking “$a_i > x$” if and only if “$a_j> x$”, and “$a_i> y$” if and only if “$a_j > y$” for every $i,j \in \{1,\dots , l\}$. After this, we can treat $A$ like one candidate of “weight” $l$. Thus, there remain only six rankings for which the score has to be investigated in order to show that “$y>A>x$” is the only ranking with minimum score.

Case 1:
Consider the ranking $\ldots >a_i>x>a_j>\ldots $, that is, candidate $a_i$ is directly followed by candidate $x$ and $x$ is directly followed by candidate $a_j$. Such a ranking cannot have minimum score since swapping $x$ and $a_j$ leads to a ranking with smaller score since “$a_j > x$” in more than $s \cdot |V| > 2/3 \cdot |V|$ votes.
Case 2:
Consider the ranking $\ldots >a_i>y>a_j>\ldots $, that is, candidate $a_i$ is directly followed by candidate $y$ and $y$ is directly followed by candidate $a_j$. This ranking cannot have minimum score since swapping $a_i$ and $y$ leads to a ranking with smaller score. This can be seen as follows. Since $s_1 < 3/4 \cdot s_2$, the number of votes with “$y> a_i$” is
$$\begin{aligned} 2s_1s_2-2s_1^2> 2s_1 (s_2- 3/4 \cdot s_2)= 1/2 \cdot s_1s_2 = |V|/2. \end{aligned}$$
Case 3:
This case consists of ranking $\ldots >a_i>x>y>a_j>\ldots $ and the same ranking with $x$ and $y$ swapped. The latter ranking would clearly have a larger score than the former so it suffices to analyze the former. We show that “$a_i> a_j > x > y$” has a smaller score than “$a_i> x> y> a_j$”. The only pairs that change the score are $\{a_j,y\}$ and $\{a_j,x\}$, contributing with
$$\begin{aligned} \#_v(a_j>y) + \#_v(a_j > x) = 2s_1^2-s_1s_2 + 2s_1^2-s_1s_2+s_1s_2-s_1^2 = 3s_1^2 -s_1s_2, \end{aligned}$$
to the old score and with $2|V| - \#_v(a_j>y) - \#_v(a_j > x)$ to the “new” score. Hence, it remains to show that the difference between the old and the new score is positive, that is,
$$\begin{aligned} 3s_1^2 -s_1s_2 -2s_1s_2 +3s_1^2 -s_1s_2= 6s_1^2 - 4s_1s_2 > 6 \cdot 2/3 \cdot s_1s_2 - 4s_1s_2 =0. \end{aligned}$$

Finally, consider the scores of all possible remaining six ranking types $r_1, \dots , r_6$:

$$\begin{aligned} r_1: A>x>y \qquad \quad r_3: x>A>y \qquad \quad r_5:y>A>x,\\ r_2: A>y>x \qquad \quad r_4: x>y>A \qquad \quad r_6: y> x>A. \end{aligned}$$

Let $t(r)$ denote the score of a ranking $r$. It is easy to verify that $t(r_1)< t(r_2)$, $t(r_1) < t(r_3)$, and $t(r_4) < t(r_6)$. Hence, to show that $r_5$ has the smallest score, it remains to compare the score of $r_5$ with the scores of $r_1$ and $r_4$. This shows that $r_5$ has a smaller score than $r_1$ and $r_4$: Since $A$ represents $l$ candidates, we count the corresponding pairs $l$ times in the following.

$$\begin{aligned} t(r_4)- t(r_5)&=~ \#_v(y>x) + l \cdot \#_v(A>x) +l \cdot \#_v(A>y)\\&\quad ~-l \cdot \#_v(A>y) -l \cdot \#_v(x>A) - \#_v(x>y), \\&=~ \#_v(y>x) + l \cdot \#_v(A>x) -l \cdot \#_v(x>A) - \#_v(x>y), \\&=~ s_1s_2-s_1^2 + l s_1^2 - l s_1s_2 + l s_1^2 -s_1^2, \\&=~ 2(l-1)s_1^2-(l-1)s_1s_2. \end{aligned}$$

In the following, we show that this term is positive. Note that $s_1/s_2>2/3 \Leftrightarrow s_1>2/3 \cdot s_2$ and $l>1$.

$$\begin{aligned}\begin{array}{ll} 0<2(l-1)s_1^2-(l-1)s_1s_2 &{}\Leftrightarrow \\ 0<2s_1^2-s_1s_2&{} \Leftarrow \\ 0<4/3 \cdot s_1s_2 - s_1s_2&{}\Leftrightarrow \\ 0 <1/3 \cdot s_1^2. \end{array} \end{aligned}$$

Thus, $t(r_5) < t(r_4)$ and, therefore, $t(r_5)<t(r_6)$.

$$\begin{aligned} t(r_1) - t(r_5)&= ~l \cdot \#_v(x>A) + l \cdot \#_v(y>A) + \#_v(y>x) -l \cdot \#_v(A>y) -l \\&\cdot \#_v(x>A)- \#_v(x>y),\\&= ls_1s_2-ls_1^2 + 2ls_1s_2-2ls_1^2 + s_1s_2-s_1^2 - 2ls_1^2+ls_1s_2 \!- ls_1s_2 \!+ls_1^2 -s_1^2,\\&= (3l+1)s_1s_2 - (4l+2)s_1^2. \end{aligned}$$

Again, we show that this term is positive. Note that $s_1/s_2<3/4$. Hence, there is a fixed and positive number $\epsilon $ such that $s_2/s_1=4/3+\epsilon $, that is, $s_2=(4/3+\epsilon )s_1$.

$$\begin{aligned}\begin{array}{ll} 0<(3l+1)s_1s_2 - (4l+2)s_1^2 &{}\Leftrightarrow \\ 0<(4/3+\epsilon )(3l+1)s_1^2 - (4l+2)s_1^2 &{}\Leftrightarrow \\ 0<4l + 3l\epsilon + 4/3 + \epsilon - 4l - 2&{} \Leftrightarrow \\ 2<3l\epsilon + 4/3 + \epsilon .&{} \end{array} \end{aligned}$$

For $l>1/\epsilon $, one has $t(r_5)<t(r_1)$ and, hence, $t(r_5)<t(r_2)$ and $t(r_5)<t(r_3)$.

As we can see in the above terms, it remains to use a number of dummy candidates $l$ that only depends on the number $s$. Our counterexample works for

$$\begin{aligned} l > \frac{1}{\epsilon } = \frac{1}{\frac{1}{s} - \frac{4}{3}} = \frac{3s}{3-4s}. \end{aligned}$$

Altogether, we showed that $r_5$ is the only Kemeny ranking. Thus, there is an election with $x \ge _s y$ for every $s \in \ ] 2/3, 3/4 [ $ such that every Kemeny ranking fulfills “$y>x$”. $\square $

2.3 A linear partial kernel preserving all Kemeny rankings

Lemma 1 shows that the -Majority Rule preserves the possibility to find all optimal solutions. After exhaustive application (at most $|C|$ times) of the -Majority Rule the remaining instances contain only dirty candidates. Making use of a simple relation between the number of dirty candidates and the average KT-distance as also used previously [7], we achieve the following.

Theorem 1

Let $d_a$ be the average KT-distance and $n_d^s$ be the number of dirty pairs according to the $\ge _s$-majority with . Kemeny Rank Aggregation admits a partial problem kernel with less than $\min \{ 16 d_a / 3, 2 n_d^s \}$ candidates. Moreover, every Kemeny ranking of the original instance contains a Kemeny ranking of the reduced instance and every Kemeny ranking of the reduced instance can be extended in polynomial time to a Kemeny ranking of the original instance.

Proof

The partial kernel derives from applying the -Majority Rule, which is (easily) adapted to work for the decision problem: An instance is reduced by deleting all candidates from $N$, reordering every vote such that $D_0>D_1> \dots >D_t$ where, inside $D_i$, $0 \le i\le t$, the order of the candidates remains unchanged, and decreasing the Kemeny score appropriately. After having applied the adapted reduction rule, an instance containing only dirty candidates remains. Let their total number be $i$. Since every dirty candidate must be involved in at least one candidate pair that is not ordered according to the -majority, there must be at least $i/2$ candidate pairs that contribute to the average KT-distance. To analyze the contribution of such a pair, we take a look at the definition of the average KT-distance.

$$\begin{aligned} d_a&= \sum _{v,w \in V}{{\mathrm{KT-distance}}}(v,w) / ({|V| (|V|-1)}),\\&= \sum _{\{v,w\} \subseteq V}{{\mathrm{KT-distance}}}(v,w) \cdot 2 / ({|V| (|V|-1)}),\\&= \sum _{\{v,w\} \subseteq V} \sum _{\{c,d\} \subseteq C} d_{v,w}(c,d) \cdot 2 / ({|V| (|V|-1)}),\\&= \sum _{\{c,d\} \subseteq C} \sum _{\{v,w\} \subseteq V} d_{v,w}(c,d) \cdot 2 / ({|V| (|V|-1)}). \end{aligned}$$

That is, each candidate pair contributes to the average KT-distance with

$$\begin{aligned} \sum _{\{v,w\} \subseteq V} d_{v,w}(c,d), \end{aligned}$$

(1)

normalized by the factor $2 / ({|V| (|V|-1)})$. The sum in Eq. (1) is the number of votes that prefer $c$ to $d$ times the number of votes that prefer $d$ to $c$. In general, this value can be between $(|V|/2) \cdot (|V|/2)$ (as many votes prefer $c$ to $d$ as votes prefer $d$ to $c$) and $0$ (all votes agree). However, for our considered candidate pairs the contribution is at least $(|V|/4) \cdot (3|V|/4)$, since they are not ordered according to the -majority. By definition of the average KT-distance, it follows that

$$\begin{aligned} d_a \ge \frac{i}{2} \cdot \frac{2}{|V|(|V|-1)} \cdot \frac{|V|}{4} \cdot \frac{3|V|}{4} > \frac{3}{16} \cdot i \Rightarrow 16/3 \cdot d_a > i. \end{aligned}$$

As a consequence, the reduced instance has less than $16 d_a /3$ candidates. The upper bound $2n_d^s$ for the number of remaining (dirty) candidates is trivial.

The relation between Kemeny rankings of the original and of the reduced instance follows with Lemma 1 and the polynomial-time computability of the data reduction. $\square $

2.4 Exploiting the Condorcet property

In this subsection, we analyze a well-known data reduction rule of practical relevance [12]. We show that it reduces an instance at least as much as the -Majority Rule, but it does not preserve all Kemeny rankings. The reduction rule is based on the following lemma which is known as the Extended Condorcet Criterion [38]. We give a short proof for the sake of completeness.

Lemma 2

Let $C' \subseteq C$ be a candidate subset with for every $c' \in C'$ and every $c \in C {\setminus } C'$. Then there is a Kemeny ranking fulfilling $C'> C{\setminus } C'$.

Proof

If there is a Kemeny ranking $l$ not fulfilling $C' > C{\setminus } C'$, then replace it by the ranking $l'$ with $C' > C{\setminus } C'$ such that the relative order within the candidates of $C'$ and $C{\setminus } C'$ remains unchanged. Now, comparing the scores of both rankings, they can only differ for pairs of candidates $c',c''$ with $c' \in C'$ and $c'' \in C {\setminus } C'$. However, every such pair contributes with at most as many points to the score of $l'$ as to the score of $l$. Thus, $l'$ must also be a Kemeny ranking. $\square $

A straight-forward implementation of Lemma 2 already leads to a data reduction rule running in $O(|V| \cdot |C|^3)$ time. However, it may happen that it is applicable several times. The following data reduction rule runs in $O(|V| \cdot |C|^2)$ time and computes (in one application) the same output as the exhaustive application of the straight-forward rule. To formulate the rule, we need the concept of a majority graph.

Definition 2

Let $(V,C)$ be an election. The strict (weak) majority graph of $(V,C)$ is a directed graph with $C$ being the set of vertices that contains an arc from $x$ to $y$ if and only if “$x>y$” in more than half of votes (in at least half of votes).

Extended Condorcet Rule

Let $(V,C)$ be an election and let $C_1, \dots , C_t \subseteq C$ be the vertex sets forming the topologically ordered strongly connected components in the strict majority graph of $(V,C)$. Replace the original instance by the subinstances induced by $C_i$ with $|C_i| \ge 2, i \in \{1, \dots , t\}$.

Constructing the strict majority graph takes $O(|V| \cdot |C|^2)$ time. Computing the strongly connected components (in topological ordering) takes $O(|C|^2)$ time. Computing the subinstances takes $O(|V| \cdot |C|^2)$ time. Hence, the total running time of the Extended Condorcet Rule is $O(|V| \cdot |C|^2)$. Alternatively, one can find the set $C_1,\dots ,C_t$ as defined in the Extended Condorcet Rule by recursively applying Lemma 2. Formally, the Extended Condorcet Rule is sound due to the following lemma (in combination with Lemma 2).

Lemma 3

Let $(V,C)$ be an election and let $C_1, \dots , C_t \subseteq C$ be the candidate sets as computed by the Extended Condorcet Rule. Then, for every $c_i \in C_i$ and every $c_j \in C {\setminus } C_j$ with $1 \le i<j \le t$. Given a Kemeny ranking for each subinstance induced by $C_i$, $i \in \{1,\dots ,t\}$, one can compute a Kemeny ranking of $(V,C)$ in $O(|V| \cdot |C|^2)$ time.

Proof

The sets $C_1, \dots , C_t \subseteq C$ represent the strongly connected components of the strict majority graph in topological ordering, that is, by definition of the topological ordering there is no arc from some vertex in $C_i$ to some vertex in $C_j$ for $1 \le i<j \le t$. By the definition of the strict majority graph, it follows that for every $c_i \in C_i$ and every $c_j \in C {\setminus } C_j$ with $1 \le i<j \le t$. Hence, by iteratively applying Lemma 2 one obtains a Kemeny ranking for the original instance by concatenating the Kemeny rankings of the subinstances induced by the vertex sets $C_1, \dots , C_t$. (This includes to concatenate candidates represented by components consisting of one single vertex.) $\square $

Remarkably, already in 1999 Cohen et al. [12] basically used what we now call the Extended Condorcet Rule for improving the solution quality (but not running time) of a greedy algorithm providing approximate results.

Although not explicitly stated, the Extended Condorcet Rule or a similar rule has been used by Schalekamp and van Zuylen [35] as preprocessing. To the best of our knowledge, in the context of exact algorithms neither its theoretically provable performance guarantees have been analyzed nor its practical effectiveness has been evaluated.

Note that Brandt et al. [10] used a similar preprocessing procedure for computing so-called tournament solutions. Similarly to the components in the Extended Condorcet Rule , they compute “dominating sets”, that is, subsets of alternatives such that all alternatives from this set have the same relationship to all alternatives not in the set.

The following proposition shows that the Extended Condorcet Rule is at least as powerful as the -Majority Rule, implying that the Extended Condorcet Rule provides a partial kernel with less than candidates.

Proposition 2

An instance where the Extended Condorcet Rule has been exhaustively applied cannot be further reduced by the -Majority Rule.

Proof

The proof is by contradiction. Assume that there is an instance reduced by the Extended Condorcet-Set Rule where the -Majority Rule successfully applies. Then, there must be a non-dirty candidate $x$ and a subset $C' \subseteq C$ with for $c' \in C'$ and for $c'' \in C''$ with $C'':=C {\setminus } (C' \cup \{n_i\})$. Clearly, in the strict majority graph there is no arc from $x$ to a vertex in $C'$ and no arc from a vertex in $C''$ to $x$.

We can assume that $C'$ and $C''$ are nonempty since otherwise the Extended Condorcet Rule would obviously reduce this instance as $x$ forms a strongly connected component in the strict majority graph. Since and , the intersection of these two sets must contain at least $|V|/2$ elements, that is, there must be at least $|V|/2$ votes with “$c'>\dots >x>\dots >c''$” for $c' \in C'$ and $c'' \in C''$. Hence, in the strict majority graph of the instance there is no arc from a vertex in $C''$ to a vertex in $C'$. Thus, $x$ forms a strongly connected component in the strict majority graph, a contradiction to the fact that the Extended Condorcet Rules was applied.$\square $

Note that the proof of Proposition 2 works analogously if one extends the -Majority Rule to search for “subsets of non-dirty candidates”, that is, to search for a subset $C' \subseteq C$ such that of the votes agree on the relative ordering of every two candidates $c' \in C'$ and $c'' \in (C {\setminus } C')$. That is, the Extended Condorcet Rule also subsumes this extended data reduction rule which clearly reduces some instances that are not reducible by the -Majority Rule .

Finally, we remark that the partial kernel obtained by the Extended Condorcet Rule as defined above does not preserve the possibility to compute all Kemeny ranking from the kernelized instance. Consider the simple election consisting of the vote $a>b$ and the vote $b>a$. When computing the partial kernel, the Extended Condorcet Rule has to fix some topological ordering. Moreover, candidates represented by strongly connected components of size one have to be removed in order to obtain the kernel size upper bound (see Theorem 1). Hence, the partial kernel obtained by the Extended Condorcet Rule cannot distinguish between the simple election above and the election consisting of two votes $a>b$, although the former has two Kemeny rankings and the latter has only one Kemeny ranking. Of course, by providing the kernelized instance with additional information one may find all Kemeny rankings, but the partial problem kernel itself (which must be an instance of the original problem) provided by the Extended Condorcet Rule does not preserve all Kemeny rankings. Note that this observation is not only of theoretical interest since a (partial) problem kernel has the nice property that one can apply every algorithm that solves the original problem on the (partial) problem kernel. If one needs additional information to find all Kemeny rankings, then such an algorithm must be able to handle this additional information.

Replacing the strict majority graph by a weak majority graph in the Extended Condorcet Rule , one obtains a slightly modified reduction rule. Following an argumentation analogous to Lemma 2 and Lemma 3 it is easy to verify that this modified Extended Condorcet Rule preserves all solutions. The following example shows that the -Majority Rule reduces instances that are not reduced by this modified rule.

Example 2

Consider the election consisting of the three candidates $a$, $b$, and $x$ and the following votes:

$a>x>b$, $a>x>b$, $b>a>x$, $x>b>a$.

The -Majority Rule identifies and removes the non-dirty candidate $x$ (preserving the only Kemeny ranking $a>x>b$). The modified Extended Condorcet Rule constructs the weak majority graph with one single strongly connected component consisting of $a$, $b$, and $x$ and, hence, does not reduce anything.

Example 2 shows that Proposition 2 does not extend to this case and it is unclear whether the upper bound for the size of the partial kernel preserving all Kemeny rankings transfers to this modified rule. A deeper analysis of kernels preserving all Kemeny rankings, so to say enumerative kernels, is a topic of independent interest [32, 37].

2.5 Tightness of the kernel bound and practical consequences

Tightness of the kernel bound.

As main theoretical contribution of this section, we showed that Kemeny Rank Aggregation admits a partial problem kernel where the number of candidates is linear in the average KT-distance even if we require that the partial kernel preserves all Kemeny rankings for a given election. The upper bound of for the number of candidates in the reduced instance, where $d_a$ denotes the average KT-distance, already holds when applying the “weaker” -Majority Rule . As the following example shows, this bound is tight for the -Majority Rule .

Example 3

Consider an election with the candidates $a$ and $b$ such that votes have “$a>b$” and the remaining votes have “$b>a$”. The -Majority Rule does not reduce this instance. The average KT-distance of this election is asymptotically

The number of candidates in this election is two. This meets the upper bound for the partial kernel of .

Proposition 2 shows that the Extended Condorcet Rule may yield a better upper bound for the size of the partial problem kernel since it may reduce instances that cannot be reduced by the -Majority Rule . The following example shows that this bound cannot be smaller than $2 d_a$.

Example 4

Consider an election with the candidates $a$, $b$, $c$, and $d$ and the following votes:

$x$ votes: $a>b>c>d$,

$x$ votes: $c>d>a>b$,

1 vote: $b>c>d>a$.

We consider the contribution of each candidate pair to the average KT-distance. The pair $\{a,b\}$ contributes with $2x$, the pair $\{c,d\}$ with zero, and every other pair with $x(x+1)$. Hence, for an arbitrary large number $x$ the average KT-distance of this election is asymptotically

$$\begin{aligned} \lim _{x\rightarrow \infty }(2x + 4x(x+1)) / ((2x+1)\cdot 2x/2) =\lim _{x\rightarrow \infty }\frac{4}{2x + 1}+2=2. \end{aligned}$$

Simjour’s recent PhD thesis [37] shows that slightly modifying the Extended Condorcet Rule yields an upper bound arbitrarily close to $2 d_a$.

Practical consequences.

From the practical point of view our kernelization results show that by applying a relatively simple data reduction rule one obtains reduced instances where the number of candidates is small if the average Kendall Tau distance is small. This gives a theoretical explanation why this kind of preprocessing is very effective for elections where the votes do not “differ too much on average”.

If one is only interested in finding one Kemeny ranking instead of all Kemeny rankings, the Extended Condorcet Rule subsumes a lot of data reduction rules, including

a rule that detects and removes Condorcet winners (resp. Condorcet losers),
the -Majority Rule, and
an extension of the -Majority Rule to non-dirty candidate sets as discussed above.

It seems reasonable to assume that any preprocessing for Kemeny Rank Aggregation should at least handle Condorcet winners and Condorcet losers. Since the worst-case time complexity of $O(|V| \cdot |C|^2)$ for finding Condorcet winners is the same as for applying the Extended Condorcet Rule , we suggest to solely implement the Extended Condorcet Rule for the scenario of finding one Kemeny ranking.

Since our practical focus is computing one optimal Kemeny ranking, in the next section, we analyze the effectiveness and the efficiency of the Extended Condorcet Rule in practice.

3 Experimental work

In this section, we first provide a detailed overview of the data sets, implementation works, and test series that are used in our experiments. Then, we summarize for each data set interesting specific results. Finally, we discuss the suitability of our approach to exactly compute a Kemeny ranking with respect to the specific data sets. Further tables with our experimental results can be found in Supplementary Material.

All experiments were carried out on a standard-PC with 3.4 GHz and 4 GB RAM (CPU: Intel(R) Core(TM) i3-2130) running under Debian 6.0 (64 bit) Linux. Source codes and test data are freely available under the GPL Version 3 license at http://www.akt.tu-berlin.de/menue/software/.

3.1 Data sets

Our test data consist of four sets of real-world data and two sets of synthetic data. We used standard applications in sport competitions and web search to obtain real-world rankings and two statistical model for random permutations to generate synthetic rankings.

3.1.1 Winter sports

Given rankings in sports for several seasons, one may ask for an overall ranking. This can be easily interpreted as ranking with individuals or teams as candidates and the seasonal rankings as votes. We generated one such election for ski jumping and one election for ski cross. To this end, we considered the world cup rankings from the seasons 2005/2006 to 2008/2009,^{Footnote 2} ignoring candidates not appearing in all four rankings. We ended up with an election consisting of four complete votes ranking 33 candidates for ski jumping and with an election consisting of four complete votes ranking 69 candidates for ski cross (see Table 1 in Supplementary Material).

3.1.2 Formula 1

Similarly to winter sports, determining the winner of a Formula 1 season can be interpreted as an election where the candidates are the drivers and the votes are the single races. We considered the seasons from 1970 till 2008. Since our current algorithms and their implementation can neither handle ties nor partial rankings, we only considered candidates that have competed in all races. Candidates that dropped out of a race are ordered according to the order determined by how long the drivers participated in the race. The generated instances have around 16 votes and up to 28 candidates (see Table 5 in Supplementary Material).

3.1.3 Web search

A prominent application of Kemeny Rank Aggregation is to aggregate search result rankings obtained from different web search engines. We queried the same 37 search terms as Dwork et al. [16], Schalekamp and van Zuylen [35], and Ali and Meilă [2] to generate rankings. We used the search engines Google, Lycos, MSN Live Search, and Yahoo to generate rankings of 1,000 websites. We considered two search results as identical if their URL is identical up to some canonical form (cutting after the top-level domain). Results not appearing in all rankings are ignored. Ignoring the term “zen budism” with only 18 candidates, this results in 36 instances having between 55 and 163 candidates (see Table 9 in Supplementary Material).

3.1.4 Web impact

We generated rankings that measure the “impact in the web” of different search terms. For a search engine, a list of search terms is ranked according to the number of the hits of each single term. We used Ask, Google, MSN Live Search, and Yahoo to generate rankings for all capitals (240 candidates), all nations (242 candidates—we obtained more nations than capitals because some capital names appeared twice), and the 103 richest people of the world (see Table 13 in Supplementary Material).^{Footnote 3}

3.1.5 Mallows model

Our first source of synthetic instances are samples from the Mallows model [27]. This is an exponential model over rankings (resp. permutations in the original setting). The Mallows model consists of a central ranking $r_0$ of $m$ elements and a dispersion parameter $\theta \in \mathbb {R}_+$ which quantifies the concentration of the rankings around the peak $r_0$ with respect to some distance measure (Kendall’s tau distance in our case). The probability of a ranking $r$ is

$$\begin{aligned} P_{r_0,\theta }(r) = \frac{e^{-\theta \cdot d(r,r_0)}}{Z} \quad \text {and}\quad Z = \sum _{r \in S_m} e^{-\theta \cdot d(r,r_0)}, \end{aligned}$$

where $d$ is Kendall’s tau distance and $S_m$ denotes the symmetric group of order $m$, that is, the set of all rankings of $m$ elements. Note that the normalization constant $Z$ is actually independent of $r_0$ and that Kemeny Rank Aggregation is a log-likelihood estimator for $r_0$ [28]. We remark that the expected KT-distance between a ranking generated by the Mallows model and its central ranking is linearly upper bounded by the number of candidates [19]. Since the KT-distance is a metric, this upper bound transfers to the average KT-distance.

We used the following parameters to create our test instances:

number of votes $n$: $4,5,10,20$;
number of candidates $m$: $10,20,\dots ,100,125,150,175,200$;
dispersion parameter $\theta $: $0.005,0.01,0.05,0.1,0.2,\dots ,0.9,0.95,0.99,0.995$.

For each parameter combination, we created ten instances.

3.1.6 Plackett–Luce model

Our second source of synthetic instances are samples from the Plackett–Luce model [25, 33]. The model parameter is a set of weights $S=\{s_1,\dots ,s_m\} \in \mathbb {R}_+^m$. The probability of a ranking $r$ of $m$ elements is

$$\begin{aligned} P_S(r) = \prod _{i=1}^m \frac{s_{r^{-1}(i)}}{Z_i}\quad \text {and}\quad Z_i = \sum _{i\le j}^{m} s_{r^{-1}(j)}, \end{aligned}$$

where $r^{-1}(j)$ denotes the $j$th elements of ranking $r$. In a sense, $s_i$ denotes the “importance” of element $i$. The Plackett–Luce model has not an as nice interpretation as the Mallows model, but it is easy to see that the mode of this distribution ranks the candidates decreasingly with respect to their importance. The votes should be less divergent (and thus easier to solve) when the weights decrease faster.

We used the following parameters to create our test instances:

number of votes $n$: $4,5,10,20$;
number of candidates $m$: $10,20,\dots ,100,125,150,175,200$;
importance weights $S$: poly1 = $\{m,m-1,\dots ,1\}$, poly2 = $\{m^2,(m-1)^2,\dots ,1\}$, poly3 = $\{m^3, (m-1)^3,\dots ,1\}$, exp2 = $\{2^m,2^{m-1},\dots ,2\}$.

For each parameter combination, we created ten instances.

3.2 Implementation and test series

Our algorithms are implemented in C++ using several libraries of the boost package.^{Footnote 4} Our implementation consists of about 4,000 lines of code.

Data reduction.

For the sake of completeness, we implemented all data reduction rules that are discussed in this paper including the -Majority Rule extended to non-dirty candidate sets, the modified Extended Condorcet Rule (restricted on the weak majority graph) and a rule that removes Condorcet winners and Condorcet losers. Preliminary tests showed that it is most efficient to only run the Extended Condorcet Rule —confirming our theoretical findings. Thus, we restrict our evaluation to analyzing the efficiency and the effectiveness of the Extended Condorcet Rule compared to running no data reduction.

Implementation of the exact algorithms.

To solve (hopefully small) parts of the instances remaining after the application of the data reduction, we implemented three exact algorithms. First, an extended version of the search tree algorithm showing fixed-parameter tractability with respect to the Kemeny score [6, 11]. Second, a dynamic programming algorithm running in $O(2^m m + nm^2)$ time for $m$ candidates and $n$ votes [6, 34]. Third, the integer linear program (ILP) [13, Linear Program 3] which was the fastest exact algorithm in previous experimental studies [13, 35]. We used the freely available ILP-solver GLPK^{Footnote 5} in version 4.43 as well as the two commercial ILP-solvers CPLEX^{Footnote 6} in version 12 and Gurobi^{Footnote 7} in version 4.6.1 to solve the ILP.

Test series.

Generally speaking, the algorithms supported by high-end ILP-solvers turned out to be the fastest. Although the dynamic programming approach is provably fast for small instances up to 23 candidates and the search tree algorithms are able to beat the dynamic programming algorithm in several instances, we recommend to use the commercial ILP-solvers: Dynamic programming and search tree algorithms are only faster than the ILP-solvers in some small instances which could be solved by all considered approaches in less than 1 s. The commercial solvers are significantly faster than GLPK for all considered instances. Since the corresponding results leave no doubt, we omit a detailed evaluation and we restrict our analysis in the following to the ILP-solver approach with the two commercial ILP-solvers CPLEX and Gurobi. However, we will see that the Extended Condorcet Rule (see Sect. 2) significantly speeds up all approaches, thus confirming our theoretical findings.

In the next subsections, we discuss the results of our experiments. To this end, we computationally analyzed several properties of the input instances. As already discussed in Sect. 2, the data reduction is able to split input instances into smaller subinstances which can be solved independently. Hence, the effectiveness of the data reduction reflects in the structure of these subinstances. To quantify this, after application of the data reduction we determined the profile of the reduced instances. Basically, a profile is a string (consisting of numbers separated by “$>$”) that expresses the sizes of the subinstances as well as the way of combining the solutions of the subinstances to get the overall solution. The profiles are be to read as follows. Every “1” stands for a trivial subinstance with one candidate, that is, the position of the candidate in the overall Kemeny ranking was determined by the data reduction. Higher numbers stand for groups of candidates whose “internal” order could not be determined by the data reduction, that is, non-trivial subinstances that can be solved independently. Sequences of $i$ number 1s are abbreviated by $1^i$. The ordering of the numbers corresponds to the ordering in which one concatenates the Kemeny rankings of the subinstances to get the overall Kemeny ranking.

For example, the profile “$1^{2}$>$3$>$1$>$2$” expresses that we know the best and the second best candidate, we know the set of candidates that must assume positions 3-5 without knowledge of their relative orders, we know the candidate in position 6, and we know the two candidates that must assume positions 7 and 8 without knowledge of their relative order. Thus, in this toy example two non-trivial subinstances remain, one with three and one with two candidates.

We performed the following test series on all instances in our data sets:

Properties.
We computed basic properties of our instances: the number of candidates and votes, the percentage of conflict pairs, the percentage of candidate pairs ordered according to the -majority, simple lower and upper bounds of the Kemeny score, the maximum position range of a candidate, the average KT-distance, and a maximum likelihood estimation for the dispersion parameter $\theta $ of the Mallows model.
Reduction.
We computed the profiles (as discussed above) of the reduced instances and other reduction quality indicators such as the number of subinstances after preprocessing, the maximum number of candidates among the subinstances, the average number of candidates among the subinstances, and the percentage of conflict pairs whose ordering in the consensus could be determined by the data reduction,.
Speedup.
We computed a Kemeny consensus for each instance, once without using data reduction and once with data reduction as preprocessing. We compared both running times, computed the speedup gained by applying the data reduction, and state the Kemeny scores of the instances.

3.3 Experimental results

In this section, we summarize results from our test series applied to the six ranking data sets.

Winter sports.

Without data reduction, the ski cross instance, consisting of 69 candidates, was solved by the ILP-solver Gurobi in about 3.9 s. In contrast, the instance was solved in 0.11 s when using the Extended Condorcet Rule from Sect. 2 as preprocessing and then applying Gurobi for the unresolved subinstances. This speedup is possible because the data reduction solves the main part of the instance in milliseconds and leaves two relatively small subinstances containing 12 and 15 candidates. Clearly, these subinstances can be quickly solved by all ILP-solvers. Furthermore, the ski jumping instance, consisting of 33 candidates, could be fully solved in milliseconds by only applying the data reduction. The speedup through applying data reduction is at least 8 with respect to both instances and both ILP-solvers. Details concerning the experimental results of this data set can be found in Tables 1–4 in Supplementary Material. Speedup and running times are illustrated in Fig. 1.

Formula 1.

The data reduction was successful for most Formula 1 instances. Except for one instance, namely the instance corresponding to the season 1983, the data reduction ended up with subinstances with at most 13 candidates each. Notably, also for 1983, for more than 20 % of the candidates their exact positions could be already determined by the data reduction. More details about the structure of the reduced instances can be found in the profiles table (Table 7 in Supplementary Material). As we can learn from Fig. 2 and from Table 8 in Supplementary Material, applying the data reduction gives a significant speedup for the overall running time in most cases. However, all Formula 1 instances are still small enough to solve them without data reduction in less than 1 s.

Currently, the winner determination in the Formula 1 is based on a “scoring rule”, that is, in a single race every candidate gets some points depending on the outcome, and the candidate with highest total score wins. To compare with the official winners, we computed Kemeny winners for the seasons from 1970 till 2008.

The Kemeny winner in most of the considered seasons is the same as the candidate selected by the scoring rule used by the FIA. In 2008, however, Lewis Hamilton was elected as world champion (beating Felipe Massa by only one point) whereas Massa was the “Condorcet driver” and thus the first candidate in every Kemeny ranking (see Fig. 3). Since in contrast to Kemeny’s voting system there is no scoring-based rule fulfilling the Condorcet property [40], such a difference in outcomes is no real surprise.

Web search.

Applying the data reduction as preprocessing helps to significantly speed up the running times of solving the Kemeny Rank Aggregation problem by the ILP-solvers (illustrated in Fig. 4). All details concerning the experimental results for this data set can be found in Tables 9–12 in Supplementary Material.

For all instances with more than 100 candidates, the results of the data reduction are displayed in Table 3: the Extended Condorcet Rule is not only able to reduce candidates at the top and the last positions but also partition some instances into several smaller subinstances. Out of the 36 instances, 22 were solved directly by the reduction and one of the other algorithms in less than five minutes. Herein, the data reduction always contributed with less than 0.5 s to the running time. For all other instances we still could compute the “top” and the “flop” candidates of an optimal ranking. For example, for the search term “telecommuting” there remains a subinstance with 109 candidates but we know the best nine candidates (and their order). The effectiveness in terms of candidates positions that have been fixed by the reduction rules is illustrated in Fig. 5. For example, the data reduction was able to fix more than 50 % of the candidate positions for more than 70 % of the instances.

Table 3 Web data instances with more than 100 candidates

Full size table

Web impact.

As for the capitals, in less than 2 s our algorithms (the data reduction and the ILP-solvers) computed a Kemeny ranking. The data reduction performed very well; the remaining subinstances correspond to the profile

$$\begin{aligned} 1^{7}\!>{5}>1^{10}>{9}>1^{14}>{34}>1^{5}>{6}>1^{3}>{6}>1^{7}>{9}\!>1^{36}>{11}>1^{9}\!>{43}>1^{26}. \end{aligned}$$

Thus, only two harder subinstances of sizes 34 and 43 remained for the ILP-solvers. The final Kemeny ranking starts as follows: London $>$ Paris $>$ Madrid $>$ Singapore $>$ Berlin $> \ldots $ For aggregating the nation rankings, our algorithms were less successful. However, we could still compute the top 6 and the flop 14 candidates with the data reduction. Surprisingly, the best represented nation in the web seems to be Indonesia, followed by France, the United States, Canada, and Australia. The instance consisting of the 103 richest persons could be solved exactly in milliseconds by only applying the data reduction. Details concerning the experimental results of this data set can be found in Tables 13–16 in Supplementary Material. Speedup and running times are illustrated in Fig. 6.

Mallows model.

Confirming our findings for the real-world data sets, the data reduction was also successful on this synthetic data set. For example, for instances with 100 candidates and 10 rankings, the data reduction was able to reduce the instances into at least 26 subinstances with maximum 36 candidates on average when the dispersion parameter $\theta $ is $0.1$ or higher.

Interestingly, the data reduction was not only able to compute some top and flop candidates, but usually, also the exact positions of candidates in the middle of the final ranking were computed. Examples can be found in Table 4.

Table 4 Typical structure of the reduced subinstances computed from the Mallows model instances

Full size table

As expected, our test series also showed a correlation between the dispersion parameter $\theta $ and the running time for solving Kemeny Rank Aggregation . When the number of candidates and rankings is fixed, the running time is highest when $\theta $ is close to zero (that is, all possible rankings are almost equally likely) and the running time is lowest when $\theta $ is close to one (that is, every ranking differing from the central ranking is very unlikely). Overall, the data reduction was quite successful for most test instances: For 100 candidates and 10 rankings, the data reduction enabled a speedup for $\theta =0.01$ and greater. The speedup factor for CPLEX is more than $10$ for $\theta \ge 0.2$. Similar effects can be observed for 4, 5, and 20 rankings. The speedup factor for Gurobi is slightly higher for most instances.

The observed effects are stronger with increasing $\theta $ as well as with decreasing number of candidates and rankings, respectively. Furthermore, the effects are weaker with decreasing $\theta $ as well as with increasing number of candidates and rankings, respectively. We illustrate speedup and running times in Figs. 7 and 8.

However, even for 200 candidates, 20 rankings and $\theta \ge 0.1$, the data reduction caused average speedup factors between $14$ and $20$ for CPLEX and between $22$ and $59$ for Gurobi. A detailed collection of our results for these instances can be found in Supplementary Material.

Plackett–Luce model.

In contrast to the Mallows model, the data reduction was less successful for instances sampled from the Plackett–Luce model. Especially for instances with 80 and more candidates, the data reduction could often only compute some top and flop instances. The data reduction could not break up the Plackett–Luce model instances into many subinstances like the Mallows model instances. Examples can be found in Table 5.

However, computing some top and flop candidates also leads to some speedup. For example, consider the instances with 70 candidates, 5 votes, and exponential decreasing weight vectors. The average speedup through computing some top and flop candidates was 1.6.

The speedups tend to be stronger with increasing polynomial degree for polynomially decreasing weight vectors as well as with decreasing number of candidates and rankings, respectively. A detailed collection of our results for these instances can be found in Supplementary Material.

Table 5 Typical structure of the reduced subinstances computed from Plackett–Luce model instances

Full size table

4 Conclusion

Our experiments showed that the Extended Condorcet Rule allows for the computation of optimal Kemeny rankings for real-world instances of non-trivial sizes within seconds. We could find exact solutions for large real-world instances with more than 100 candidates by means of data reduction within milliseconds while the corresponding instances could not be directly solved within 10 s by sole use of the ILP (the fastest exact algorithm [13]). A key feature of the Extended Condorcet Rule is to break instances into smaller, independent instances. In more general terms, algorithm engineering in combination with data reduction and kernelization for NP-hard voting problems seems to be an area with several promising challenges [5].

On the theoretical side, we improved the previous partial problem kernel [7] with respect to the parameter average KT-distance from quadratic to linear size. It is open whether there is a linear partial kernel with respect to the $\ge _s$-majority for any , thus potentially giving data reduction rules that are not subsumed by the Extended Condorcet Rule . A natural step in answering this question seems to investigate whether for two non-dirty candidates $a,b$ there must be a Kemeny ranking with $a > b$ if $a \ge _s b$.

An important extension of Kemeny Rank Aggregation is to consider “constraint rankings”, that is, the problem input may additionally contain a prespecified order of some candidate pairs in the consensus list [41]. Here, the considered data reduction cannot be applied anymore. Extending the data reduction framework to rankings with ties or partial rankings would be of high interest as well.

Kemeny Rank Aggregation with Partial Rankings seems to be the most interesting extension. Unfortunately, considering majorities (as the -Majority Rule and the Extended Condorcet Rule do) seems not to be promising for practically important instances where the majority of the pairwise relations of the candidates is left unspecified. Furthermore, the problem becomes more difficult: It is NP-hard already for two partial rankings [9] and even if the maximum KT-distance between two input rankings is zero [4], destroying any hope for fixed-parameter tractability or (partial) problem kernels for the parameter “maximum KT-distance” or “average KT-distance”. Hence, a different approach and different parameters seem necessary to obtain any provable guarantees for the effect of polynomial-time preprocessing.

In contrast to Kemeny Rank Aggregation with Partial Rankings, for Kemeny Rank Aggregation with Ties a quadratic partial kernel with respect to the parameter “average KT-distance” is known [7]. It looks promising to investigate whether this can also be improved to a linear partial kernel.

Notes

Let $r=x_1>x_2>\dots >x_m$. Then, $\pi (r)$ is the ranking $\pi (x_1)>\pi (x_2)>\dots >\pi (x_m)$.
Obtained from the German server http://www.sportschau.de/sp/wintersport/.
http://en.wikipedia.org/wiki/List_of_national_capitals, http://en.wikipedia.org/w/index.php?title=Forbes_list_of_billionaires_%282008%29&oldid=335542552
http://www.boost.org/
http://www.gnu.org/software/glpk/
http://www.ibm.com/software/integration/optimization/cplex-optimizer/
http://www.gurobi.com/

References

Ailon, N., Charikar, M., & Newman, A. (2008). Aggregating inconsistent information: ranking and clustering. Journal of the ACM, 55(5), 1–27.
Article MathSciNet Google Scholar
Ali, A., & Meilă, M. (2012). Experiments with Kemeny ranking: What works when? Mathematical Social Sciences, 64(1), 28–40.
Article MATH MathSciNet Google Scholar
Bartholdi, J, I. I. I., Tovey, C. A., & Trick, M. A. (1989). Voting schemes for which it can be difficult to tell who won the election. Social Choice and Welfare, 6, 157–165.
Article MATH MathSciNet Google Scholar
Betzler, N. (2010). A multivariate complexity analysis of voting problems. PhD thesis, Friedrich-Schiller-Universität Jena.
Betzler, N., Bredereck, R., Chen, J., & Niedermeier, R. (2012). Studies in computational aspects of voting—a parameterized complexity perspective. In The multivariate algorithmic revolution and beyond. LNCS (Vol. 7370, pp. 318–363). New York: Springer.
Betzler, N., Fellows, M. R., Guo, J., Niedermeier, R., & Rosamond, F. A. (2009). Fixed-parameter algorithms for Kemeny rankings. Theoretical Computer Science, 410(45), 4554–4570.
Article MATH MathSciNet Google Scholar
Betzler, N., Guo, J., Komusiewicz, C., & Niedermeier, R. (2011). Average parameterization and partial kernelization for computing medians. Journal of Computer and System Sciences, 77(4), 774–789.
Article MATH MathSciNet Google Scholar
Biedl, T., Brandenburg, F. J., & Deng, X. (2009). On the complexity of crossings in permutations. Discrete Mathematics, 309(7), 1813–1823.
Article MATH MathSciNet Google Scholar
Brandenburg, F., Gleißner, A., & Hofmeier, A. (2012). Comparing and aggregating partial orders with Kendall tau distances. In Proceedings of the 6th international workshop on algorithms and computation (WALCOM ’12). LNCS (Vol. 7157, pp. 88–99). New York: Springer.
Brandt, F., Brill, M., & Seedig, H. G. (2011). On the fixed-parameter tractability of composition-consistent tournament solutions. In Proceedings of the 22nd international joint conference on artificial intelligence (IJCAI ’11) (pp. 85–90). Menlo Park, CA: AAAI Press.
Bredereck, R., (2010). Fixed-parameter algorithms for computing Kemeny scores—Theory and practice. Studienarbeit, Universität Jena, CoRR abs/1001.4003.
Cohen, W. W., Schapire, R. E., & Singer, Y. (1999). Learning to order things. Journal of Artificial Intelligence Research, 10(1), 243–270.
MATH MathSciNet Google Scholar
Conitzer, V., Davenport, A., & Kalagnanam, J. (2006). Improved bounds for computing Kemeny rankings. In Proceedings of the 21st national conference on artificial intelligence (AAAI’06) (pp. 620–626). Menlo Park, CA: AAAI Press.
Davenport, A., & Kalagnanam, J. (2004). A computational study of the Kemeny rule for preference aggregation. In Proceedings of the 19th national conference on artificial intelligence (AAAI’04) (pp. 697–702). Menlo Park, CA: AAAI Press.
Downey, R. G., & Fellows, M. R. (1999). Parameterized complexity. New York: Springer.
Book Google Scholar
Dwork, C., Kumar, R., Naor, M., & Sivakumar, D. (2001). Rank aggregation methods for the Web. In Proceedings of the 10th international World Wide Web conference (WWW’01) (pp. 613–622).
Fagin, R., Kumar, R., & Sivakumar, D. (2003). Efficient similarity search and classification via rank aggregation. In Proceedings of the 22nd ACM SIGMOD international conference on management of data (SIGMOD’03) (pp. 301–312). New York: ACM.
Fernau, H., Fomin, F. V., Lokshtanov, D., Mnich, M., Philip, G., & Saurabh, S. (2011). Ranking and drawing in subexponential time. In Proceedings of the 21st international conference on combinatorial algorithms. LNCS (Vol. 6460, pp. 337–348). New York: Springer.
Fligner, M. A., & Verducci, J. S. (1986). Distance based ranking models. Journal of the Royal Statistical Society, 48(3), 359–369.
MATH MathSciNet Google Scholar
Flum, J., & Grohe, M. (2006). Parameterized complexity theory. Berlin: Springer.
Google Scholar
Hemaspaandra, E., Spakowski, H., & Vogel, J. (2005). The complexity of Kemeny elections. Theoretical Computer Science, 349(3), 382–391.
Article MATH MathSciNet Google Scholar
Jackson, B. N., Schnable, P. S., & Aluru, S. (2008). Consensus genetic maps as median orders from inconsistent sources. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 5(2), 161–171.
Article Google Scholar
Karpinski, M., & Schudy, W. (2010). Faster algorithms for feedback arc set tournament, Kemeny rank aggregation and betweenness tournament. In Proceedings of the 21st international symposium on algorithms and computation (ISAAC’10). LNCS (Vol. 6506, pp. 3–14). Berlin: Springer.
Kenyon-Mathieu, C., & Schudy, W. (2007). How to rank with few errors. In Proceedings of the 39th annual ACM symposium on theory of computing (STOC’07) (pp. 95–103). New York: ACM.
Luce, R. D. (2005). Individual choice behavior: A theoretical analysis. New York: Dover Publications.
Book Google Scholar
Mahajan, M., Raman, V., & Sikdar, S. (2009). Parameterizing above or below guaranteed values. Journal of Computer and System Sciences, 75, 137–153.
Article MATH MathSciNet Google Scholar
Mallows, C. (1957). Non-null ranking models. I. Biometrika, 44(1/2), 114–130.
Article MATH MathSciNet Google Scholar
Mandhani, B., & Meilǎ, M. (2009). Tractable search for learning exponential models of rankings. Journal of Machine Learning Research—Proceedings Track, 5, 392–399.
Google Scholar
Meilǎ, M., & Bao, L. (2010). An exponential model for infinite rankings. Journal of Machine Learning Research, 11, 3481–3518.
Google Scholar
Meilǎ, M., Phadnis, K., Patterson, A., & Bilmes, J. (2007). Consensus ranking under the exponential model. In Proceedings of the 23rd Conference on Uncertainty in artificial intelligence (UAI’07) (pp. 285–294).
Niedermeier, R. (2006). Invitation to fixed-parameter algorithms. Oxford: Oxford University Press.
Book MATH Google Scholar
Nishimura, N., & Simjour, N. (2013). Parameterized enumeration of (locally-) optimal aggregations. In 13th International symposium on algorithms and data structures (WADS’13). LNCS (Vol. 8037, pp. 512–523). New York: Springer.
Placket, R. L. (1975). The analysis of permutations. Applied Statistics, 24, 193–202.
Article Google Scholar
Raman, V., & Saurabh, S. (2007). Improved fixed parameter tractable algorithms for two “edge” problems: MAXCUT and MAXDAG. Information Processing Letters, 104(2), 65–72.
Article MATH MathSciNet Google Scholar
Schalekamp, F., & van Zuylen, A. (2009). Rank aggregation: Together we’re strong. In Proceedings of the 11th workshop on algorithm engineering and experiments (ALENEX’09) (pp. 38–51). Philadelphia: SIAM.
Simjour, N. (2009). Improved parameterized algorithms for the Kemeny aggregation problem. Proceedings of the 4th international workshop on parameterized and exact computation (IWPEC’09). LNCS (Vol. 5917, pp. 312–323). New York: Springer.
Simjour, N. (2013). Parameterized enumeration of neighbour strings and Kemeny aggregations. PhD thesis, University of Waterloo.
Truchon, M. (1998). An extension of the Condorcet criterion and Kemeny orders. Technical report, cahier 98–15 du Centre de Recherche en Économie et Finance Appliquées, Université Laval, Québec, Candada.
Young, H. P. (1995). Optimal voting rules. Journal of Economic Perspectives, 9(1), 51–64.
Article Google Scholar
Young, H. P., & Levenglick, A. (1978). A consistent extension of Condorcet’s election principle. SIAM Journal on Applied Mathematics, 35(2), 285–300.
Article MATH MathSciNet Google Scholar
van Zuylen, A., & Williamson, D. P. (2009). Deterministic pivoting algorithms for constrained ranking and clustering problems. Mathematics of Operations Research, 34(3), 594–620.
Article MATH MathSciNet Google Scholar

Download references

Acknowledgments

We are grateful to the anonymous referees of the Fifth International Symposium on Parameterized and Exact Computation (IPEC-2010) and of the Third International Workshop on Computational Social Choice (COMSOC-2010) for constructive feedback helping to improve this work. We are indebted to three anonymous referees of JAAMAS for providing numerous insightful remarks that helped to significantly improve the paper. In particular, the more efficient and effective data reduction rule exploiting the extended Condorcet property helped to improve our theoretical and practical results. We thank Christian Komusiewicz for pointing us to an improved (compared to the conference version) analysis for the bound of Theorem 1, and our student research assistant Leila Arras for her great support in doing implementations and experiments with synthetic data. Nadja Betzler and Robert Bredereck were supported by the DFG, research project PAWS, NI 369/10.

Author information

Authors and Affiliations

Institut für Softwaretechnik und Theoretische Informatik, TU Berlin, Berlin, Germany
Nadja Betzler, Robert Bredereck & Rolf Niedermeier

Authors

Nadja Betzler
View author publications
You can also search for this author in PubMed Google Scholar
Robert Bredereck
View author publications
You can also search for this author in PubMed Google Scholar
Rolf Niedermeier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Bredereck.

Additional information

A preliminary version of this work appeared under the title “Partial Kernelization for Rank Aggregation: Theory and Experiments” in Proceedings of the 5th International Symposium on Parameterized and Exact Computation (IPEC-2010), Chennai, India, December 2010, volume 6478 in Lecture Notes in Computer Science, pp. 26–37, Springer. This journal version expands on the conference version by siginificantly revising the theoretical part, extending the range of test data, and by using state-of-the-art ILP-solvers. This work was started while all authors were with Friedrich-Schiller-Universität Jena, Germany.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 382 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Betzler, N., Bredereck, R. & Niedermeier, R. Theoretical and empirical evaluation of data reduction for exact Kemeny Rank Aggregation. Auton Agent Multi-Agent Syst 28, 721–748 (2014). https://doi.org/10.1007/s10458-013-9236-y

Download citation

Published: 04 October 2013
Issue Date: September 2014
DOI: https://doi.org/10.1007/s10458-013-9236-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Theoretical and empirical evaluation of data reduction for exact Kemeny Rank Aggregation

Abstract

Similar content being viewed by others

Parameterized aspects of distinct Kemeny rank aggregation

Parameterized Aspects of Distinct Kemeny Rank Aggregation

On Maximum Rank Aggregation Problems

Explore related subjects

1 Introduction

2 Data reduction and a linear partial problem kernel

2.1 Definitions and previous results

2.2 Exploiting -majorities

Observation 1

Lemma 1

Proof

Definition 1

Example 1

Proposition 1

Proof

2.3 A linear partial kernel preserving all Kemeny rankings

Theorem 1

Proof

2.4 Exploiting the Condorcet property

Lemma 2

Proof

Definition 2

Extended Condorcet Rule

Lemma 3

Proof

Proposition 2

Proof

Example 2

2.5 Tightness of the kernel bound and practical consequences

Example 3

Example 4

3 Experimental work

3.1 Data sets

3.1.1 Winter sports

3.1.2 Formula 1

3.1.3 Web search

3.1.4 Web impact

3.1.5 Mallows model

3.1.6 Plackett–Luce model

3.2 Implementation and test series

3.3 Experimental results

4 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 382 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation