1 Introduction

One-third of all sports registered with the International Olympic Committee rely on judges’ evaluations about the athletes’ performance (e.g., gymnastics, diving, skating, boxing or dressage, among others). These judgments are made by qualified, but potentially biased judges. For instance, judges’ decisions in sports can be biased by the existence of relationships of friendship, personal interests, social/crowd pressure, lobbies and exchange of favors, or even by the race, gender, nationality or reputation of the athlete (Sect. 2 discusses the different biases in sports in greater detail). In this context, it is difficult to infer whether a particular judgment is biased or not, because evaluations of qualitative performance are complex and inherently open to subjectivity and manipulation. At the same time, the technical, artistic complexity and subjectivity of sports performance, together with aspects such as time pressure and cognitive load, make performance evaluation a difficult task (Plessner and Haar 2006).

Performance evaluation depends on the alignment of material incentives (Baker 1992; Dohmen and Sauermann 2016), but also on the social environment and cognitive biases, i.e., perceptual distortion, and inaccurate or illogical interpretations (Asch 1951; Baron 2007; Deutsch and Gerard 1955; Kahneman and Tversky 1972). These biases arise from various processes that operate simultaneously in the judges’ minds, such as for instance, heuristics (Shah and Oppenheimer 2008; Tversky and Kahneman 1974), noise and limited information processing capacity (Hilbert 2012; Simon 1955), emotional and moral motivations (Pfister and Böhm 2008) and social influence (Wang et al. 2001). According to Tversky and Kahneman (1974) and Kahneman and Tversky (1996) biases can also be shortcut strategies to processing complex information.

In order to deal with these difficulties, the International Olympic Committee, together with the majority of the recognized international federations, establishes that the final score is the arithmetic mean of the scores of all the judges on the panel, and in some cases the most extreme scores are removed from the calculation (i.e., the highest and the lowest scores). However, this commonly used procedure may not be effective, since most forms of bias are subtle and can be dissimulated in a strategic way (Bassett Jr and Persky 1994; Osório 2017; Plessner and Haar 2006; Wu and Yang 2004).

In this context, we are tempted to think that a score closer to the mean is less likely to be biased. However, there is a crucial aspect to take into consideration; such a score may not be compatible with the grading style of that particular judge. For instance, if the judge in question is known to be particularly strict, i.e., to be a judge who usually awards scores well below the panel mean, a score closer to the mean may actually carry bias. In other words, the judge might be strategically hiding bias by grading closer to the panel’s mean and deviating from their own style. For that reason, any score aggregation procedure must take into consideration each judge’s grading style and any deviations from it. The argument can be reversed, in the sense that a score well above (or below) the mean may not necessarily suggest the existence of bias, since it may actually be compatible with that particular judge’s grading style.

In our context, a judge’s grading style is a measurement based on the judge’s history of past scores relative to the scores awarded by the panels in which the judge has participated. If a judge has a history of consistently grading above the panel mean, then this judge might be considered as being more lenient than other judges. On the other hand, if a judge has a history of consistently grading below the panel mean, then this judge might be considered as being stricter than other judges. In this context, Looney (2004) points out that sport governing bodies can improve the methods of performance evaluation by considering aggregation procedures that are able to capture the grading consistency of each judge.

The objective of this paper is to propose a bias correction procedure that aggregates the grades of all the judges on the panel, and that can simultaneously control for deviations in the judges’ grading styles. There are several reasons to justify the introduction of this type of operational solution in sports performance evaluation (Bassett Jr and Persky 1994; Osório 2017; Wu and Yang 2004), and in other dimensions of our lives (Balinski and Laraki 2007, 2010; Beliakov et al. 2007; Grabisch et al. 2011a, b). First, the functioning of our society as a whole—not only sports performance evaluation—frequently relies on the ranking of objects, places, performances, projects, ideas, policies, issues, etc.Footnote 1 In this context, the development of better evaluation procedures are of first-order relevance for numerous scientific, academic and professional fields. Second, strategic bias is difficult to identify by third party monitoring. Spectators, sport governing bodies and even experts fail to detect bias and are not aware of subtle aspects like each judge’s grading style. In most cases, information about each judge’s grading style is not even available. Third, from a cognitive perspective, the consideration of such strategic aspects is difficult because it requires the processing of large amounts of information. Fourth, nowadays the vast majority of scoring systems are completely automated, which simplifies matters and invites the use of more complex algorithms that can help reduce and mitigate the potential effects of bias. For instance, Díaz-Pereira et al. (2014) suggest the use of human motion recognition and artificial intelligence technologies in order to reduce bias and assist judges in the decision making process (see Cust et al. (2019) for a review of this literature).

In line with the previous discussion, the score aggregation procedure proposed in this paper is designed to penalize simultaneously scores deviations from the judges’ grading style and scores deviations from the judges’ panel mean. These deviations are the ones that are most likely to be biased. The argument is that if a judge favors or penalizes a particular candidate, then this judge must be grading differently from his/her grading style and/or differently from the other judges on the same panel. Consequently, such a score should receive less weight than the scores of the other judges that are being more consistent with their grading styles and with the panel mean, and vice versa. In this context, it is the information contained in the grades of the other judges and in their grading history that determines the relevance given to each score.

Subsequently, we show that the proposed score aggregation procedure satisfies a set of desirable properties, and we consider its application to a unique data set from the 2000 Summer Olympic Games diving competitions in order to see how it reacts to the possibility of bias. We found that the implied corrections are not large enough to unequivocally support changes to the medal standings as suggested by Emerson et al. (2009). Nonetheless, the results obtained do not contradict Emerson et al. (2009). The differences are justified by the fact that the proposed score aggregation procedure corrects for the effect of deviations from the panel mean and the judges’ grading style, but not so much for the influence of other forms of bias, such as for example, nationalistic bias.

To summarize, the contribution in this paper has three main aspects that distinguish it from the existing literature (see the literature review below). First, the proposed score aggregation procedure does not intend to detect and to analyze bias ex-post (i.e., after the final score is released), but to reduce and mitigate the effect of bias ex-ante (i.e., before the final score is released). Second, the proposed score aggregation procedure controls for deviations from the panel mean and/or from the judges’ grading style. The consideration of deviations from the judges’ grading style is new in the literature. Third, the proposed score aggregation procedure is not specific to a particular type of bias, but addresses bias in general, which makes it a useful tool for academics, practitioners and professionals in applied work. However, we must be careful, in the sense that it does not capture or remove all the existing bias and all the different types of bias. For instance, the proposed aggregation procedure has some limitations when it comes to dealing with bias that affects all or the majority of the judges, or bias towards the mean, instead of away from it, as in a Keynes beauty contest (Keynes 1936), and it is not designed to address a particular and specific form of bias (e.g., nationalistic bias), which must be treated individually.

This paper is organized as follows: Sect. 2 provides a brief review of the literature. Section 3 presents the score aggregation procedure, Sect. 4 states and discusses a set of desirable properties, Sect. 5 provides an illustrative application to the 2000 Olympic Games diving competition, and Sect. 6 presents the conclusions.

2 Literature Review

This section reviews (i) the literature on sports performance evaluation bias with a brief reference to other cases of performance evaluation, and (ii) the literature on preferences and judgments aggregation.

In order for judges to act in accordance with the interests of the associated competition organizing body, the material incentives should be aligned (Baker 1992; Dohmen and Sauermann 2016). As in a principal-agent relationship, unbiased judgments should be rewarded and biased judgments should be punished. In this context, bribes, friendships, personal interests and lobbies distort incentives and consequently induce biased decisions (Duggan and Levitt 2002; Wolfers 2006).

However, individual decisions also depend on non-material aspects associated with the social environment and on cognitive biases, e.g., perceptual distortions, and inaccurate or illogical interpretations (Asch 1951; Baron 2007; Deutsch and Gerard 1955; Kahneman and Tversky 1972). These biases arise from various processes that operate in the judges’ minds and that are difficult to separate from each other—for instance, heuristics (Shah and Oppenheimer 2008; Tversky and Kahneman 1974), noise and limited information processing capacity (Hilbert 2012; Simon 1955), emotional and moral motivations (Pfister and Böhm 2008), and social influence (Wang et al. 2001).

The list of cognitive biases reported over the last decades is continuously evolving (Baron 2007). In this context, the complexity of sports performance evaluation together with aspects like time pressures, cognitive load, and performance subjectivity makes this subject very active in terms of research. The following review of the literature offers a brief summary of some of this research. For more exhaustive reviews of the literature, see Bar-Eli et al. (2011), Dohmen and Sauermann (2016) and Plessner and Haar (2006).

Nationalistic bias is a particular type of bias that has been frequently reported in sports performance evaluation literature. For instance, Coupe et al. (2018) studied bias in the FIFA Ballon d’Or award for the best soccer player. They found that judges are biased towards candidates from their own country, national team, continent and league team. Popović (2000) examined the rhythmic gymnastics competition in the 2000 Summer Olympics and found that judges tend to favor their own country’s gymnasts, but not sufficiently to be statistically significant. Similarly, Zitzewitz (2006) examined the figure skating and the ski jumping competitions in the 2002 Winter Olympics and found evidence in favor of nationalistic bias (see also Lock and Lock 2003; Zitzewitz 2014). Emerson et al. (2009) examined the diving competition in the 2000 Summer Olympics and concluded that nationalistic bias could have influenced the final medals standing.

Similarly, using data from the Eastern Ontario and Québec sections of Skate Canada, Findlay and Ste-Marie (2004) found reputation bias in figure skating. The ranks were better when the skaters were evaluated by judges who knew their reputation than when evaluated by judges who did not knew their reputation.

In gymnastics, within-team order bias is particularly common. In this case, biased expectations are induced by the common strategy used by coaches of placing their strongest gymnasts later in the order of rotation. Plessner and Haar (2006) found that this strategy induces judges to give higher marks to performances at the end of the rotation order than if that same performance had been observed earlier in the rotation order. In the same way, Damisch et al. (2006) found that sequential performance judgments in sports are biased by the previously judged performances, which depends on the degree of perceived similarity between the successive performances.

Using data from World Figure Skating Championships between 2001 and 2003, Lee (2008) show the existence of outlier aversion bias, in which judges avoid grading far from the panel mean, as in a beauty contest (Keynes 1936).

The home team advantage is another well studied form of bias, being observed in many sports like football, basketball, baseball and ice hockey, and is often explained by the crowd’s influence on judges’ decisions (Dohmen and Sauermann 2016; Garicano et al. 2005; Nevill et al. 1996; Price et al. 2012; Sutter and Kocher 2004; Unkelbach and Memmert 2010). In the same vein, Page and Page (2007) found that teams have a higher chance of qualifying for the next round when they play the second leg at home.

Racial bias in sports—which is frequent in other dimensions of our lives—has been found among National Basketball Association referees (Price and Wolfers 2010; Larsen et al. 2008), and among Major League Baseball umpires (Parsons et al. 2011).

Other forms of bias, which are not so common, have been reported in the sports literature (Dohmen and Sauermann 2016). For instance, Helsen et al. (2006) found the existence of cognitive and perceptual biases with offside calls. Offside calls depend crucially on the position of the referee relative to the players. Frank and Gilovich (1988) found that shirt color can induce cognitive biases amongst football and ice hockey players.

In this paper, we focus on sports performance evaluation, but the number of situations that require performance evaluations, and that are affected by different sources of bias is endless. Bias is not merely an issue prevalent in subjective evaluations, but inherent to every dimension of life (Buchanan et al. 1998).Footnote 2 The score aggregation procedure in this paper attempts to mitigate the effect of bias from performance evaluation.

In addition to the limitations associated with subjective judgments, there are also difficulties at the aggregation stage. A large body of literature in sports performance evaluation (Bassett Jr and Persky 1994; Osório 2017; Wu and Yang 2004), and judgment in general (Balinski and Laraki 2007, 2010, 2014; Felsenthal and Machover 2008), has proposed different solutions to aggregate the preferences of different individuals (Beliakov et al. 2007; Grabisch et al. 2011a, b). For instance, Osório (2017) proposes an aggregation procedure that corrects deviation from the panel mean, while this paper goes a step further, by proposing an aggregation procedure that can also correct deviations from each judge’s grading style.

The most common solution, among the International Olympic Committee and the international federations, is “range voting”, in which judges rate the candidates with a grade within a specified interval. The candidate with the highest sum or average wins. The method is easy to implement and passes certain generalizations of the Arrow (1950) impossibility theorem, but it is particularly sensitive to bias and strategic manipulation.

Often, in order to deal with this difficulty a truncation is used to remove extreme scores and mitigate potential bias. In this context, “majority judgment” ranks candidates by the median score, i.e., all scores are truncated, except the middle one, which becomes the final score (Balinski and Laraki 2007, 2010). This procedure is more robust to manipulation and reduces the incentives to exaggerate. However, excessive truncation leads to a loss of information and diversity, in particular if bias is only a possibility. The score aggregation procedure proposed in this paper preserves the information and diversity of opinions contained in the judges panel while mitigating the effects of bias.

3 The Score Aggregation Procedure

In general, there is no evidence to prove conclusively whether a particular score is biased or not. Moreover, it is virtually impossible to control all forms of conscious and unconscious bias and manipulation. Another difficulty is that the judges’ preferences and interests are private information and impossible to determine ex-ante. In this context, any score aggregation procedure must depend only on what is known, which in many cases is not too much. In what follows, we propose a score aggregation procedure that attempts to deal with these practical limitations and to mitigate the effect of bias.

In this context, we control for bias in two dimensions. The first dimension controls for score deviations from the judges’ historical grading style. Each judge has a unique grading style. Some judges are systematically more strict or lenient than others. The second dimension controls for score deviations from the panel of judges’ mean score.

Let \(s_{ij}\in \left[ S_{-},S_{+}\right] \subset {\mathbb {R}} \) be the score awarded by judge \(j\in J=\left\{ 1,...,n\right\} \) for the performance of competitor \(i\in I=\left\{ 1,...,m\right\} .\) We consider well-defined scores on numerical scales, with no language-consistency issues among the judges, e.g., \(\left[ S_{-},S_{+}\right] =\left[ 0,10\right] .\) Let \({\mathbf {s}}_{i}=(s_{i1},...,s_{in})\) denote the vector of scores awarded by the panel of judges for the performance of competitor \(i\in I.\) The mean score of the performance of the competitor \(i\in I\) is denoted as \( {\overline{s}}_{i}\) and corresponds to the arithmetic mean over the scores awarded by all judges, i.e., \({\overline{s}}_{i}\equiv \frac{1}{n} \sum \nolimits _{j=1}^{n}s_{ij}.\)

In addition, in order to determine each judge’s grading style, we consider the history of past scores. Let the history of the past scores awarded by the judge \(j\in J\) be denoted as \({\mathbf {h}}_{j}^{t},\) and let the history of past mean scores awarded on the panels on which judge \(j\in J\) participates be denoted as \({\mathbf {h}}_{(j)}^{t},\) where the superscript \(``t'' \) denotes the moment in time. The vectors \({\mathbf {h}}_{j}^{t}\) and \({\mathbf {h}}_{(j)}^{t}\) consist of the past scores that are considered relevant in defining the judge’s j style, i.e., \({\mathbf {h}} _{j}^{t}=(s_{.j}^{t-1},s_{.j}^{t-2},...)\) and \({\mathbf {h}}_{(j)}^{t}=( {\overline{s}}_{.(j)}^{t-1},{\overline{s}}_{.(j)}^{t-2},...),\) respectively, where the subscript “.” expresses the irrelevance of the competitors identity associated with that history of past scores. For example, these vectors may consist of all the scores awarded over the last year, or all the scores awarded up to the present event, or any other criteria.Footnote 3 In order to keep the notation as simple as possible, when possible, we remove the explicit reference to time “t ”.

In this context, in order to determine each judge’s grading style, one possibility is to aggregate the history of past scores into a single measure (e.g., a simple average, a weighted average, or any other stable criteria).Footnote 4 Let \({\overline{s}}_{ {\mathbf {h}}_{j}}\equiv \frac{1}{T}\sum \nolimits _{t=1}^{T}s_{.j}^{t}\) be the arithmetic mean of judge \(j^{\prime }s\in J\) history of past scores (where T is the number of scores considered for the history of judge \(j\in J\)), and \({\overline{s}}_{{\mathbf {h}}_{(j)}}\equiv \frac{1}{T}\sum \nolimits _{t=1}^{T} {\overline{s}}_{.(j)}^{t}\) be the arithmetic mean of the history of past mean scores awarded by the panels on which judge \(j\in J\) has been involved (i.e., the mean of the history of panel means).Footnote 5 In our context, given the performance of competitor i and the panel mean \( {\overline{s}}_{i},\) the style adjusted expected grade of judge \(j\in J\) to competitor i,  i.e., the grade of judge j that would be compatible with his/her own style, is defined as follows.

Definition 1

The style adjusted expected grade of judge j is defined as

$$\begin{aligned} {\overline{s}}_{i{\mathbf {h}}_{j}}={\overline{s}}_{i}{\overline{s}}_{{\mathbf {h}}_{j}}/ {\overline{s}}_{{\mathbf {h}}_{(j)}}, \end{aligned}$$

where the ratio \({\overline{s}}_{{\mathbf {h}}_{j}}/{\overline{s}}_{{\mathbf {h}} _{(j)}}\) defines the grading style of judge j.

Therefore, the ratio \({\overline{s}}_{{\mathbf {h}}_{j}}/{\overline{s}}_{{\mathbf {h}} _{(j)}}\) captures judge \(j^{\prime }s\) history of deviations from the panel mean, and defines judge \(j^{\prime }s\) grading style.Footnote 6

The following example provides an illustration.

Example 1

Suppose judges 1,  2 and 3 awarded the scores \( s_{i1}=7.00\), \(s_{i2}=7.00\) and \(s_{i3}=8.00,\) respectively, with \({\overline{s}}_{i}=7.33.\) Suppose also that their history is summarized by the mean of the past scores, i.e., \({\overline{s}}_{{\mathbf {h}}_{1}}=8.00\), \({\overline{s}}_{ {\mathbf {h}}_{2}}=7.00\) and \({\overline{s}}_{{\mathbf {h}}_{3}}=6.00,\) respectively, and by the mean of the past means awarded by the panels on which these judges have been involved, i.e., \({\overline{s}}_{{\mathbf {h}} _{(1)}}=7.00\), \({\overline{s}}_{{\mathbf {h}}_{(2)}}=7.00\) and \({\overline{s}}_{ {\mathbf {h}}_{(3)}}=7.00,\) respectively. Then, each judge style adjusted expected grade would be given by \({\overline{s}}_{i{\mathbf {h}}_{1}}=8.38,\)\( {\overline{s}}_{i{\mathbf {h}}_{2}}=7.33\) and \({\overline{s}}_{i{\mathbf {h}} _{3}}=6.29,\) respectively.

This example suggests that despite judges 1 and 2 seeming to agree on a final score of 7.00,  judge 3,  by proposing a score of 8.00,  might be deviating from his/her grading style. Note that judge 3 has a history of being strict by awarding on average \({\overline{s}}_{{\mathbf {h}}_{3}}=6.00\) on panels that awarded on average \({\overline{s}}_{{\mathbf {h}}_{(3)}}=7.00.\) In this context, in order to be consistent with his/her grading style and the scores of the other judges, judge 3 should have proposed a score somewhere near 6.86. The scores aggregation procedure proposed in this paper has the objective of reducing the influence of diverging scores like the score awarded by judge 3. However, we must be careful, because bias is only a possibility.

In the context of the present paper, this objective will be achieved by reducing the weight given to the divergent scores. For that reason, the scores aggregation procedure proposed in this paper will give a weight of \( 33.8\%\) and \(47.5\%\) to the scores of each of the judges 1 and 2,  respectively, and only a weight of \(18.7\%\) to the score of judge 3 (for \( \alpha =1/2\) and \(\gamma =2,\) see below).

Formally, the weights are functions \(w_{ij}:D_{1}\times ...\times D_{n}\rightarrow \left[ 0,1\right] ,\) where \(D_{k}=\left[ S_{-},S_{+}\right] ^{1+|{\mathbf {h}}_{k}|+|{\mathbf {h}}_{(k)}|},\)\(|{\mathbf {h}} _{k}|\) denotes the cardinality of each judge \(k^{\prime }s\) history of past scores, and \(|{\mathbf {h}}_{(k)}|\) denotes the cardinality of the history of past mean scores awarded on the panels participated in by judge k (note that we may have \(|{\mathbf {h}}_{k}|=|{\mathbf {h}}_{(k)}|\)). In other words, the weights depend on the history of past scores of each judge \(j\in J,\) i.e., \( \{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n}\), which defines judge \( j^{\prime }s\) grading style \({\overline{s}}_{i{\mathbf {h}}_{j}},\) and on the scores awarded by all the judges, i.e., the vector \({\mathbf {s}}_{i}.\)

Definition 2

The weights are defined as:

$$\begin{aligned} w_{ij}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})\equiv \frac{\sum \nolimits _{k\ne j}^{n}(\alpha \left| s_{ik}-{\overline{s}}_{i {\mathbf {h}}_{k}}\right| +(1-\alpha )\left| s_{ik}-{\overline{s}} _{i}\right| )^{\gamma }}{(n-1)\sum \nolimits _{k=1}^{n}(\alpha \left| s_{ik}-{\overline{s}}_{i{\mathbf {h}}_{k}}\right| +(1-\alpha )\left| s_{ik}-{\overline{s}}_{i}\right| )^{\gamma }}, \end{aligned}$$
(1)

for all \(j\in J.\)

Consequently, given the performance of competitor \(i\in I,\) the vector of scores awarded by the n judges are aggregated into a single score, according to the following definition.

Definition 3

The score aggregation procedure is defined as:

$$\begin{aligned} {\overline{s}}_{i}^{*}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}} _{(k)}\}_{k=1}^{n})\equiv \sum \nolimits _{j=1}^{n}w_{ij}s_{ij}, \end{aligned}$$
(2)

where \(w_{ij}\) represents the weight given to the score \(s_{ij}\) awarded by judge \(j\in J\) for the performance of competitor \(i\in I,\) with \( \sum \nolimits _{j=1}^{n}w_{ij}=1.\)

In case of a tie between two or more competitors, the reader is free to consider any tie-breaking rule.

Definitions 1, 2 and 3 fully describe the score aggregation procedure proposed in this paper.

The parameters in Definition 2 have the following interpretation. The parameter \(\alpha \in \left[ 0,1\right] \) controls the importance given to score deviations from the judges grading style, and \(1-\alpha \) controls the importance given to score deviations from the panel mean.Footnote 7 In the particular case where \(\alpha =1,\) only the deviations from the grading style are punished, while in the particular case where \(\alpha =0,\) only the deviations from the panel mean are punished. However, since we are interested in punishing both types of deviations simultaneously, we set \( \alpha \in (0,1).\)

The parameter \(\gamma \ge 0\) determines the magnitude of punishment of the score deviations. The larger the value of \(\gamma ,\) the stronger the punishment to scores that are distant from the judges’ grading style and from the panel mean. However, values of \(\gamma \) that are too large can be problematic because bias is only a possibility, and we do not want to distort the results in cases in which there is no bias. On the other hand, low values of \(\gamma \) may not penalize biased scores enough.Footnote 8

In our context, Expression (1) is written in the most general form, and the parameters \(\alpha \) and \(\gamma \) are controlled by the social planner or the sport’s governing body. However, in applied and operational work, in order to simplify the analysis, we can consider \(\alpha =1/2\) (i.e., equal importance to both types of deviations) and \(\gamma =2\) (i.e., the quadratic distance). The proposed score aggregation procedure is particularly flexible when it comes to accommodate different possibilities.

The weight given to judge j in Expression (1) increases with the deviations of the other judges k from their grading style \(\left| s_{ik}-{\overline{s}}_{i{\mathbf {h}}_{k}}\right| \) and the panel mean \(\left| s_{ik}-{\overline{s}}_{i}\right| ,\) and decreases with the deviations from the judge \(j^{\prime }s\) own grading style \( \left| s_{ij}-{\overline{s}}_{i{\mathbf {h}}_{j}}\right| \) and the panel mean \(\left| s_{ij}-{\overline{s}}_{i}\right| .\) Expression (1) also meets the objective of penalizing most the largest score deviations from the judges’ grading style and from the panel mean, which are the scores that are most likely to be biased. Simultaneously, the correction mechanism gives more weight to judges that are assumed not to be biased, which are the ones whose scores show higher prevalence and similitude, and are more consistent with their own grading style. This idea motivates the score aggregation procedure in this paper.

Intuitively, on the one hand, the first term in the numerator of Expression ( 1) considers score deviations from the grading style of all the judges other than judge \(j\in J,\) i.e., all \(k\ne j\in J\). On the other hand, the second term in the numerator of Expression (1) considers score deviations from the panel mean of all the judges other than judge \(j\in J,\) i.e., all \(k\ne j\in J.\) Therefore, the more (respectively, the less) judge \(j\in J\) deviates from his/her grading style and from the panel mean, the more (respectively, the less) weight receives the scores of the other judges \(k\ne j\in J,\) and the less (respectively, the more) weight receives his/her scores.

In this context, in order to understand the intuition behind the proposed aggregation method, consider the following decomposition of judge \(j^{\prime }s\) evaluation of competitor \(i^{\prime }s\) performance:

$$\begin{aligned} s_{ij}=a_{i}+u_{ij}+b_{ij}, \end{aligned}$$

where \(a_{i}\) is the unknown actual evaluation of competitor \(i^{\prime }s\) performance, \(u_{ij}\) is the unbiased deviation of judge j,  which is i.i.d for each judge according to some distribution, and \(b_{ij}\) is the subjective bias of judge j towards competitor i,  which can also be seen as a random variable. Therefore, \(u_{ij}\) captures judge \(j^{\prime }s\) grading style, while \(b_{ij}\) captures judge \(j^{\prime }s\) subjective bias, where we are assuming that \(u_{ij}\) and \(b_{ij}\) are independent and that the proposed additive decomposition exists. In this context, judge \( j^{\prime }s\) grading style component of \(s_{ij},\) in the absence of bias, is equal to \({\overline{s}}_{i{\mathbf {h}}_{j}}=a_{i}+u_{ij},\) while the panel mean is equal to \({\overline{s}}_{i}=\sum \nolimits _{k\ne j}^{n}(a_{i}+u_{ik}+b_{ik})/n=a_{i}+\sum \nolimits _{k\ne j}^{n}u_{ik}/n+\sum \nolimits _{k\ne j}^{n}b_{ik}/n.\) Therefore, the distinctive component of judge \(j^{\prime }s\) weight in Expression (1) can be written as:

$$\begin{aligned}&\alpha \left| s_{ij}-{\overline{s}}_{i{\mathbf {h}}_{j}}\right| +(1-\alpha )\left| s_{ij}-{\overline{s}}_{i}\right| =\alpha \left| b_{ij}\right| +(1-\alpha )\left| u_{ij}\right. \\&\quad \left. -\sum \nolimits _{k\ne j}^{n}u_{ik}/n+b_{ij}-\sum \nolimits _{k\ne j}^{n} b_{ik}/n\right| . \end{aligned}$$

In other words, the deviations from the grading style capture the subjective bias of judge j,  while the deviations from the panel mean capture deviations from the other judges’ average grading style and average subjective bias, respectively.

This decomposition provides an alternative intuition into how the proposed aggregation method mitigates and reduces the effects of subjective bias, i.e., either directly, by means of subjective bias itself, or indirectly, by means of deviations from the other judges’ subjective bias.

Fig. 1
figure 1

On the left-hand side - judges 1, 2 and 3 weights \( {\small w}_{ij}\) for varying \( {\small s}_{i3}.\) On the right-hand side - the score aggregation function \({\overline{s}}_{i}^{*}\) and the arithmetic mean \({\overline{s}}_{i}\) for varying \( {\small s}_{i3}.\) (vector of scores \( {\small (7.00,7.00,s}_{i3}{\small ),}\) vector of mean past scores \({\small (8.00,7.00,6.00),}\) and vector of mean past panel mean scores \({\small (7.00,7.00,7.00),}\) for \( {\small \alpha =1/2}\) and \( {\small \gamma =2}\))

Figure 1, which is in connection with Example 1 , illustrates the essence of the score aggregation procedure, for the case of three judges and when the score of judge 3 varies. Briefly, on the left-hand side of Fig. 1, since judge 2 is grading nearer his/her style and the panel mean than the other judges, the weight given to judge 2 always remains high. Simultaneously, the weight given to judge 3 decreases as he/she moves away from his/her grading style and the panel mean. On the other hand, the weight given to judge 1 increases as judge 3 grades above 7.00,  because in relative terms, the score of judge 1 is becoming more consistent with his/her grading style and the panel mean. Consequently, the right-hand side of Fig. 1 shows the decreasing impact of judge \(3^{\prime }s\) score on the final score as the mechanism corrects the increasing deviations from his/her grading style and the panel mean.

4 Properties of the Score Aggregation Procedure

In this section, we take a closer look at some additional properties of the proposed scores aggregation procedure in Expression (2) and its weights given by Expression (1). We adapt into our context some basic properties that have been considered in the literature (Balinski and Laraki 2007; Felsenthal and Machover 2008), and which are not always easily satisfied by other aggregation procedures (Beliakov et al. 2007; Grabisch et al. 2011a, b). The analysis of these properties provides a deeper understanding of the proposed score aggregation procedure that can be useful to researchers, sport-governing bodies, decision-makers and practitioners.

The proof of these properties is simple and for that reason omitted. They follow directly from the definition of the score aggregation procedure in Expression (2) and the definition of weights in Expression (1).

A commonly desired property is homogeneity of degree zero in weights, which implies that the weights are independent of the units of measurement.

Property 1

(homogeneity) The weights are homogeneous of degree zero, i.e., \( w_{ij}(\lambda {\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)} \}_{k=1}^{n})=w_{ij}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}} _{(k)}\}_{k=1}^{n})\) for all \(i\in I\) and \(j\in J,\) and for any \(\lambda >0.\)

This property means that if we double the score of all the judges, the weight given to each judge remains unchanged. This property is passed on to the score aggregation procedure, which becomes scale-consistent, i.e., homogeneity of degree one in the final score. Consequently, if we double the score of all the judges, the final score doubles, but the ranking of each competitor remains unchanged.

Property 2

(scale-consistent) The score aggregation procedure is scale-consistent, i.e., \( {\overline{s}}_{i}^{*}(\lambda {\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}} _{(k)}\}_{k=1}^{n})=\lambda {\overline{s}}_{i}^{*}({\mathbf {s}}_{i},\{ {\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})\) for all \(i\in I,\) and for any \(\lambda >0.\)

However, in our context, the score aggregation procedure depends on the identity of each judge, because each judge has a different grading style, which is characterized by his/her history of past scores (see the discussion in Footnote 3). This aspect is crucial in order to reduce and mitigate the possible effects of bias.

The absolute value function employed in the proposed score aggregation procedure guarantees an equal treatment of scores on both sides of the judges’ grading style and the panel mean. This aspect is important because bias may be hidden above or below the judges’ grading style and the panel mean. Monitoring is achieved by considering simultaneously deviations from these reference values. In this context, the proposed score aggregation procedure returns the arithmetic mean when the grades of all the judges on the panel coincide.

Property 3

(unanimity) If \(s_{ij}=s_{ik}\) for all \(j,k\in J,\) then \({\overline{s}} _{i}^{*}={\overline{s}}_{i}.\)

The property does not imply that weights are the same, because each judge has a different grading history or style, but it implies that if all judges awarded the same grade to a given performance, then the final score must be that grade. Nonetheless, we must note that a strategic judge (strict or lenient) can be hiding bias even when grading in line with all the other judges. In this particular case, the aggregation procedure reflects the difficulty in building a strong argument in the event of biased behavior.

In addition, the score aggregation procedure must be independent of irrelevant alternatives. In other words, the grades awarded to competitors other than competitor \(i\in I\) cannot affect the final score of competitor \( i\in I,\) and the judges’ past scores not considered in the history cannot affect the final score of competitor \(i\in I\) (see Footnote 3).

Property 4

(independence of irrelevant alternatives) The score aggregation procedure is independent of irrelevant alternatives, i.e., \({\overline{s}}_{i}^{*}\) is independent of everything not in \({\mathbf {s}}_{i}\) and \(\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n}.\)

The score aggregation procedure should also be continuous, where continuity has the usual mathematical meaning. In other words, small changes in the numerical scores (i.e., the input), should imply small changes in the final score (i.e., the output). This property is convenient for most practical applications.

Property 5

(continuity) The score aggregation procedure \({\overline{s}}_{i}^{*}( {\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})\) is continuous in \({\mathbf {s}}_{i}.\)

Note also that the score aggregation procedure is differentiable, except when the absolute value function is not differentiable, i.e., when \(s_{ik}= {\overline{s}}_{i{\mathbf {h}}_{k}}\) or \(s_{ik}={\overline{s}}_{i}.\) Differentiability almost everywhere is also a convenient property for practical applications.

Note also that in general the scores aggregation procedure \({\overline{s}} _{i}^{*}\) tends to be a monotonic function of \(s_{ij}.\) The exception occurs for sufficiently large score deviations from the grading style or the panel mean, and when these deviations are heavily punished (i.e., by means of a large value of \(\gamma \)). Therefore, the failure of this property occurs only under extreme circumstances and is due to the bias correction objective implicit in the score aggregation procedure. For instance, if a judge awards a score relatively higher than his/her grading style or the panel mean, the final score may fall if the decrease in the weight given to that judge is stronger compared to the increase in the score.

Lastly, Properties 15 cannot uniquely characterize the proposed scores aggregation procedure. The difficulty arises from the fact that the weights in Expression (1) are not constant and depend in a nonlinear way on the scores that they weigh.Footnote 9

5 A Data Application to the Olympic Games

In this section, we apply the proposed score aggregation procedure to the diving competition of the 2000 Summer Olympic Games. The objective is to illustrate the application of the score aggregation procedure to real data, and to discuss some implementation issues and the obtained results.

The data set is obtained from Emerson et al. (2009), and is composed of 10,788 dives with specific information about the score and the difficulty of each dive, the identity of each diver and the identity of each judge for the preliminary round, the semi-final and the final stages of the event. The level of detail in the available information makes this data set particularly unique for studying bias in sports performance evaluation.

We start by describing the aggregation procedure used by the International Olympic Committee to compute the final score. The judges awarded scores ranging from 0 to 10 in increments of 0.5. The judging panel was composed of seven judges making independent assessments about each dive quality. For each dive, the final score is calculated by removing the lowest and the highest scores and averaging the middle five scores. The scores were then multiplied by the degree of difficulty \(DD_{i}\) and by 3,  in accordance with the following formula:

$$\begin{aligned} Olympic\text { }score\text { }(dive\text { }i)=DD_{i}\times 3\times {\overline{s}} _{i}^{\prime }, \end{aligned}$$
(3)

for all \(i\in I,\) where \({\overline{s}}_{i}^{\prime }\) denotes the truncated average resulting from the middle five scores, i.e., \({\overline{s}} _{i}^{\prime }=(\sum \nolimits _{j=1}^{7}s_{ij}-\min _{j}\left\{ s_{ij}\right\} -\max _{j}\left\{ s_{ij}\right\} )/5.\)

In order to compare our results with the International Olympic Committee, we also remove the lowest and highest scores. Note that in our context, in the case of more than one lowest and highest score, the removed score is the one associated with the judge with the largest deviation from his/her grading style. Therefore, we may be already removing some potentially biased scores. The final score is then obtained by multiplying the scores aggregation procedure by the degree of difficulty \(DD_{i}\) and by 3,  as in the International Olympic Committee procedure, according to the following formula:

$$\begin{aligned} Scores\text { }aggregation\text { }procedure\text { }(dive\text { } i)=DD_{i}\times 3\times {\overline{s}}_{i}^{*}, \end{aligned}$$
(4)

for all \(i\in I,\) where \({\overline{s}}_{i}^{*}\) denotes the score aggregation procedure in Expression (2) with the weights given by Expression (1).

Lastly, in both procedures, the scores obtained in each dive are added up to obtain each diver’s final score.

In what follows, we analyze the men’s 3-m springboard and the women’s 10-m platform diving competitions. In these two events, the difference between the first two divers is very narrow so that the medals final standing could have been easily influenced by the presence of bias.

5.1 The Men’s 3-meter Springboard Diving Competition

In the 2000 Summer Olympics, the diver Xiong Ni of China won the gold medal with an extremely narrow margin from Fernando Platas of Mexico (Column (1) of Table 1). The result generated controversy because of the eleven dives counting for the final score (i.e., six dives from the final stage and five dives from the semi-final stage), three dives were awarded by a committee with a judge from the same nationality as the winning diver. The Chinese judge Facheng Wang participated in the semi-final stage, and three of his judgments counted towards the final score. Note that judges with the same nationality as the competitors are not normally assigned to the final stage, although they can be assigned to earlier stages of the competition, as in this case to the semi-final stage.

Some years later, Emerson et al. (2009) studied the diving competition of the 2000 Summer Olympic Games. However, their results were not sufficiently significant to support the argument that the judge Facheng Wang benefited the diver Xiong Ni in the men’s 3-m springboard diving competition.

In what follows, we apply the scores aggregation procedure proposed in this paper to the 2000 Summer Olympics men’s 3-m springboard diving competition data and discuss the results obtained.

Table 1 The men’s 3-meter springboard diving competition final scores: comparison of the Olympic Committee procedure (Olympics) and the scores aggregation procedure (SAP) for \(\gamma =2.\)

The final Olympic score calculated using (3) is shown in Column (1) of Table 1. The application of the proposed score aggregation procedure, with \(\alpha =1/3\) and \(\gamma =2,\) to the grades awarded in the eleven dives returns the first place to Ni Xiong with 709.74 points against Fernando Platas with 709.33 points (Column (2) in Table 1). Similarly, the application of the proposed score aggregation procedure, with \(\alpha =2/3\) and \(\gamma =2,\) to the grades awarded in the eleven dives returns the first place to Ni Xiong with 710.07 points against Fernando Platas with 709.03 points (Column (3) in Table 1).Footnote 10 Therefore, the proposed score aggregation procedure corroborates Emerson et al. (2009) and the medal’s final standing.Footnote 11 However, the medal’s final standing would have changed for \(\alpha <0.12\) with \(\gamma =2. \)

In what follows, we discuss the results obtained and their intuition in more detail. In this context, consider the information in Tables 2 and 3 regarding the scores awarded to the divers Xiong Ni and Fernando Platas, respectively, by a panel of judges in which the judge Facheng Wang participated. The Column with the label “semi #” refers to the grade awarded by the associated judge, and the subsequent Column with the label “style #” refers to the expected grade associated with that judge’s grading style. Grading style is measured following the method in Sect. 3. Since we have no available information about the judges’ grading history before the Olympics, the judges’ grading styles are calculated by averaging the grades awarded by each judge during the full Olympic event.

The row with the label “AVERAGE” shows the panel mean and the mean of the grading styles, respectively. The row with the label “% DEV. from AVERAGE ” shows the percentage by which the judge Facheng Wang graded the divers differently from the panel mean. The row with the label “% DEV. from STYLE” shows the percentage by which the judge Facheng Wang graded the divers differently from his grading style.

Table 2 The men’s 3-meter springboard diving competition: the grades awarded to Xiong Ni (CHN) and the expected grades compatible with each judge style in the semi-finals first three dives.

It is clear from the information in Table 2 that in all three dives performed by Xiong Ni (CHN), the judge Facheng Wang deviated more from the panel mean (i.e., deviations of \(8.2\%\), \(3.7\%\) and \(0.8\%\), respectively) than from his own grading style (i.e., deviations of \(5.5\%\), \( 1.2\%\) and \(-0.2\%\), respectively). Similarly, it is clear from the information in Table 3 that in all three dives performed by Fernando Platas (MEX), the judge Facheng Wang deviated less from the panel mean (i.e., deviations of \(-7.1\%\), \(-4.8\%\) and \(-1.0\%\), respectively) than from his own grading style (i.e., deviations of \(-9.8\%\), \(-7.4\%\) and \( -3.5\%\), respectively). This information seems to support the idea that the judge Facheng Wang may have simultaneously benefited the diver Xiong Ni and penalized the diver Fernando Platas.

Both types of deviations are captured by the score aggregation procedure proposed in this paper. However, when applying the aggregation procedure in this paper (see Table 1) the grades of the judge Facheng Wang appear not to be significantly biased and not determinant. The reason might be that the scores awarded in the first dive of Xiong Ni (see Table 2), and the first and second dives of Fernando Platas (see Table 3), were removed from the calculation of the final score. These are apparently the most biased grades. The other three dives, which are considered in the calculation of the final score, are much milder and for that reason not strong enough to induce a significant change in the medal’s final standing. This fact may explain why the scores awarded by the judge Facheng Wang seem to have no clear and significant influence on the medal’s final standing according to the score aggregation procedure.

Table 3 The men’s 3-meter springboard diving competition: the grades awarded to Fernando Platas (MEX) and the expected grades compatible with each judge style in the semi-finals first three dives.

For this reason, i.e., after removing these three extreme scores, in overall terms, the judge Facheng Wang seems to be deviating more from the mean than from his own grading style. Consequently, we can still observe a reversion in the medal’s final standing if we place more importance on score deviations from the panel mean (i.e., for \(\alpha <0.12,\) not shown in Table 1), but not otherwise (i.e., for \(\alpha \ge 0.12,\) as shown in Columns (2) and (3) of Table 1). However, the magnitude of the difference between Fernando Platas and Ni Xiong, even in the most extreme case of \(\alpha =0,\) would be very small (i.e., 0.22 points), and for that reason not strong enough to unequivocally support the argument that the judge Facheng Wang has favored Ni Xiong.

These conclusions could have changed drastically, and the score aggregation procedure would have shown more significant corrections, if we have not removed the most extreme grades, as is done by the International Olympic Committee.

Note that the diver ranked in fourth place in Column (1) of Table 1, Xiao Hailiang, is also from China. Table 4 shows the grades awarded and the associated expected grading style of the seven judges in the three semi-final dives in which the judge Facheng Wang participated. The labels and interpretation given to the information in Table 4 are similar to the ones in Table 2. The same is true in the interpretation of the results, where again; the data seem to suggest that the judge Facheng Wang has awarded higher scores to the diver from the same country Xiao Hailiang. The same scoring pattern observed in Table 2 for the diver Ni Xiong is also present in Table 4 for the diver Xiao Hailiang. In other words, in the three dives, the judge Facheng Wang has simultaneously deviated from the overall mean (i.e., \(6.8\%\) , \(3.8\%\) and \(4.4\%\), respectively) and from his own grading style (i.e., \( 6.0\%\), \(2.4\%\) and \(2.4\%\), respectively). In the three dives, the judge Facheng Wang was always among the judges awarding the highest score to the diver Xiao Hailiang.

Table 4 The men’s 3-meter springboard diving competition: the grades awarded to Xiao Hailiang (CHN) and the expected grades compatible with each judge style in the semi-finals first three dives.

However, this case was not so controversial because the distance between the diver Xiao Hailiang and the diver Dmitri Sautin (ranked in third place) is very large.

In line with our comments, the Olympic Committee and the score aggregation procedure deliver similar numbers in terms of magnitude (see Table 1), which is not necessarily undesirable, since in most cases, bias is only a possibility. Therefore, the score aggregation procedure should correct potential bias, but without distorting the results. In this context, the proposed score aggregation procedure is a refinement of the procedure employed by the Olympic Committee, but it does not dispense with the use of transparency policies like the public disclosure of each judge’s grade, which are simply and particularly effective anti-bias monitoring schemes.

5.2 The Women’s 10-meter Platform Diving Competition

Similarly, Emerson et al. (2009) have also studied the women’s 10-m platform diving competition. They found that judging bias (not necessarily nationalistic bias) could have changed the medals final standing. The diver Laura Wilkinson of USA finished ahead of Li Na of China by 1.74 points (i.e., 543.75 and 542.01 points, respectively), but after removing the effect of bias, they found that the diver Li Na would have won the event by a margin of 0.36 points. Most of the controversy is driven by the fact that Li Na was well-above Laura Wilkinson after the four semi-final dives, but lost the event in the five final dives. Since both divers finished very close to each other, any potential bias could have made the difference.

However, the identity and the type of bias reported in Emerson et al. (2009) is not clearly specified. Nonetheless, since the score aggregation procedure is constructed to correct for bias, we have applied it to the scores awarded to the nine dives counting to the final score of the women’s 10-m platform diving competition. We found that the application of the proposed score aggregation procedure for \(\alpha =1/3\) and \(\alpha =2/3\) (with \(\gamma =2\) constant), confirms the first place for Laura Wilkinson with 544.68 and 544.61 points, respectively, against Li Na with 541.60 and 541.65 points, respectively. Laura Wilkinson’s advantage is reinforced by the score aggregation procedure.

Note that our results do not contradict the results found by Emerson et al. (2009) in support of the existence of bias in favor of Laura Wilkinson. In particular, as pointed out by Emerson et al. (2009) there might exist multiple sources of biases of unknown magnitude affecting both divers in different ways. The difference between our results and their results is justified by the fact that the score aggregation procedure in this paper addresses general forms of bias that are based on deviations from the panel mean and the judges grading styles. It has not been specifically designed to address nationalistic bias, but it corrects nationalistic bias that materializes either in the form of deviations from the panel mean or the judges’ grading style. In this context, our results may suggest that both divers’ scores of both divers might have been affected by different forms of bias. Separating and distinguishing between these different cognitive biases is difficult.

6 Conclusion

The existence of bias distorts the quality, reliability, validity and objectivity of the evaluation process, and leads to ineffective decision-making. This issue is relevant in numerous scientific, academic and professional fields.

This paper proposes a practical score aggregation procedure that attempts to reduce and mitigate the influence of bias in subjective judgments. The starting point is to acknowledge that it is virtually impossible to design a procedure that can prevent all forms of bias ( Gibbard (1973) and Satterthwaite (1975)). The reason is that conscious bias can be hidden in very complex and strategic ways, and judges are rational agents who can learn how the procedure functions and adjust strategically in order to make bias detection difficult. Consequently, bias is unlikely to disappear, but its influence can be seriously restricted if we adopt adequate bias correction mechanisms.

In this context, the proposed score aggregation procedure offers a tool that can help correct and mitigate the effects of bias that are based on deviations from the panel mean and the judges’ grading style. The argument is that biased behavior is associated with either deviations from the mean judgment and/or deviations from the individual judgment style. However, the proposed aggregation procedure has some limitations when it comes to dealing with bias that affects all or the majority of the judges, or bias towards the mean, instead of away from it, and it is not designed to address a particular and specific form of bias (e.g., nationalistic bias), which must be treated individually. For that reason, the proposed procedure does not dispense with the complementary and simultaneous use of transparency policies, such as for instance the public disclosure of each judge’s score, which are simple and particularly powerful anti-bias mechanisms (Zitzewitz 2014).Footnote 12 However, in reality, and in order to avoid speculation, detailed data about the scores awarded by each judge are usually not publicly available, which creates difficulties when it comes to identifying potential biased behaviors.Footnote 13

In this paper, we focus mostly on sports, but the number of situations that require individual judgments and evaluations, and that can be the object of different sources of bias is endless. The approach in this paper can be extended to these other dimensions of our lives. Nowadays, the internet is making evaluation procedures based on subjective judgments extremely common. Many websites and mobile phone apps ask their users to rate anonymously (or not) all kinds of items, goods and services—from tourist places and blog comments to wines, books, films or music. It is this increasing interest in the content of subjective judgments and their associated controversies that motivates the present paper and the need to study bias in subjective judgments in more detail (Frey and Gallus 2017; Frey 2017).

Despite the difficulties associated with the fact that data is not publicly available, and the challenges associated with the design of mechanisms that can prevent or mitigate the influence of all forms of bias, there is plenty of research to be done in this area. A large body of empirical and experimental literature identifies the existence of multiple forms and sources of bias (Bar-Eli et al. 2011; Dohmen and Sauermann 2016; Plessner and Haar 2006). However, in most cases, there are no practical or operational solutions that can be applied in real life situations to remove or minimize the negative effects of bias on peoples’ lives. This paper is a step forward in this direction and the continuation of an extensive research agenda in bias correction mechanisms in subjective evaluations and judgments.

In this context, we hope this paper will help researchers, practitioners and professionals to better understand how bias operates in subjective judgments, and consequently to provide guidance in the design and implementation of optimal aggregation procedures that can reduce and mitigate the effects of bias in our lives.