1 Introduction

The kappa statistic was proposed by Cohen (1960) to measure the agreement between two raters (also called “judges” or “observers”), independently judging n subjects through a scale consisting of q categories. Kappa has become a well known index for the comparison of expert advices, especially in the psychometric field (Uttal et al. 2013; Harvey and Tang 2012; Markon et al. 2011; Östlin et al. 1990).

A comprehensive review of inter-rater agreement coefficients has been put forth by Gwet (2008) and Dijkstra and Eijnatten (2009).

The use of Cohen’s kappa statistic has been increasing despite two important paradoxes (Cicchetti and Feinstein 1990; Feinstein and Cicchetti 1990): (i) the presence of high levels of raters’ agreement with low kappa values (related to prevalence of the trait in the sample) and (ii) the lack of predictability of changes in the statistic with different marginals (due to the symmetry of rates in the disagreement categories). This paradoxical behaviour has been widely studied (Cicchetti and Feinstein 1990; Feinstein and Cicchetti 1990; Lantz and Nebenzahl 1996; Shoukri 2004).

On the contrary, very little attention has been devoted so far to a similar problem affecting the statistic proposed by Fleiss (1971) as a multiple-raters generalization of the Cohen’s kappa. As a matter of fact, in specific situations, Fleiss’ kappa takes very low values even in the case of high agreement.

This paradox is due to the fact that this measure of agreement for nominal scales is not invariant under permutation of categories. In order to solve this problem, we propose a permutation-invariant version of Fleiss’ kappa that is not affected by the paradox.

Since the problem depends on particular combination of category assignment and the scale is nominal, we apply permutation techniques without loss of information. In particular, we permute the dataset, we calculate Fleiss’ kappa on each “permuted” dataset and we synthesize the results with a robust statistic.

In Sect. 2 we describe Fleiss’ statistic, in Sect. 3 we discuss its paradoxical behaviour and in the subsequent Sections we show a method to solve the problem of paradoxes through the combined use of permutation techniques and resampling methods.

2 Fleiss’ kappa statistic

We consider an inter-rater reliability study with \(n\) subjects and \(r\) rates per subject. All raters have to assign each subject in one of q exhaustive and mutually exclusive categories.

These studies involve raters who are experts in a given area (e.g. physicians—in particular psychologists and psychiatrists—archaeologists, art critics, judges, etc.). It is possible to quantify the agreement among observers who have participated to a survey.

Table 1 shows the frequency distribution of \(r\) raters by \(n\) subjects and \(q\) response categories: \(r_{ij}\) represents the number of rates assigning the \(i\)th subject \(\left( i = 1, ..., n\right) \) to the \(j\)th category \(\left( j = 1, ..., q\right) \).

Table 1 Distribution of raters by subject and response category

In Table 1, the marginal distribution \(r_{i \cdot }=\sum ^{q}_{j=1}{r_{i j}}=r\) provides the total number of raters and the marginal \(r_{\cdot j}=\sum ^{n}_{i=1}{r_{i j}}\) provides the total number of assignments to category j.

When two or more raters agree in assigning the subject \(i\) to category \(j\), then the agreement among raters is showed by the corresponding frequency in Table 1: \({r_{i j}}\ge 2\).

Using the binomial coefficient, we determine the number of concordant pairs:

$$\begin{aligned} {{r_{i j}} \atopwithdelims ()2 }= \frac{r_{i j}\left( r_{i j}-1\right) }{2}. \end{aligned}$$

We define the proportion of pairs of concordant raters assigning subject \(i\) to category \(j\) as:

$$\begin{aligned} P_{ij}=\frac{{{r_{i j}} \atopwithdelims ()2 }}{{{r} \atopwithdelims ()2 }}=\frac{r_{i j}\left( r_{i j}-1\right) }{r\left( r-1\right) }. \end{aligned}$$

Hence, we can calculate the proportion of concordant pairs for the \(i\)th subject for all the \(r(r-1)\) possible pairs of assignments:

$$\begin{aligned} P_i=\sum ^{q}_{j=1}{p_{i j}}=\sum ^{q}_{j=1}{\frac{{{r_{i j}} \atopwithdelims ()2 }}{{{r} \atopwithdelims ()2 }}}=\frac{1}{r-1}\left( \frac{1}{r} \sum _{j=1}^{q}{r_{ij}^{2}-1} \right) . \end{aligned}$$

The overall agreement can be measured referring to Fleiss (1971) and Fleiss et al. (2003):

$$\begin{aligned} \bar{P}=\frac{1}{n} \sum _{j=1}^{n}{P_i}=\frac{1}{r-1}\left( \frac{1}{nr} \sum _{i,j}{r_{ij}^{2}-1} \right) \!. \end{aligned}$$
(1)

In general, a subject is considerate deterministically assigned to a category when, repeating several times the judgment, none of the raters changes its categorization. On the other hand, the categorization of a subject is defined as random, in case it does not depend on a shared evaluation, but is only due to chance.

The overall agreement of two or more raters has indeed to be interpreted as the observable effect of the combination of two non-observable factors: a deterministic factor and a random factor. In order to isolate the deterministic component (the object of our study), we have firstly to define the chance-agreement probability.

According to Scott (1955) and Fleiss (1971), the probability of agreement due to chance is given by the following proportion:

$$\begin{aligned} p_j=\frac{r_{\cdot j}}{nr}=\frac{1}{nr} \sum _{i=1}^{n}{r_{ij}} \end{aligned}$$

and the random expected agreement is given by:

$$\begin{aligned} \bar{P}_e=\sum _{j=1}^q{p_j^2}\in \left[ \frac{1}{q},1\right] . \end{aligned}$$
(2)

If we correct the overall agreement probability (1) for the agreement probability due to chance (2) and normalize, we obtain the statistic:

$$\begin{aligned} K_{Fleiss}=\frac{\bar{P}-\bar{P}_e}{1-\bar{P}_e}\in \left[ -\frac{1}{r-1},1\right] , \end{aligned}$$
(3)

proposed by Fleiss (1971) as a generalization of Cohen’s kappa (1960).

With respect to this point, it should be noted that Fleiss’ kappa is the multiple-raters extension of Scott’s \(\pi \) index (Scott 1955; Gwet 2008) and not of Cohen’s kappa. Fleiss’ kappa is one of the most common indices to quantify multiple-raters agreement (Fleiss et al. 2003), but in practice it could return inconsistent results.

3 Paradoxical behaviour of Fleiss’ kappa

In Table 2 we describe a particular case of poor performance of Fleiss’ kappa. All subjects are distributed in the first two categories, in equal proportion. Let \(M\) be an integer between \(0\) and \(r\). When \(M\) varies, we change from a situation of complete agreement among the examiners (produced by the extreme values \(M = 0\) and \(M = r\)) to a situation of minor agreement (in correspondence of the intermediate values of \(M\)).

Table 2 Distribution of raters by subject and response category leading to the paradoxical behaviour of Fleiss’ kappa

Expression (3), applied to Table 2, returns:

$$\begin{aligned} K_{Fleiss}=-\frac{1}{r-1},\quad \hbox {for}\, 0<M<r \end{aligned}$$

and

$$\begin{aligned} K_{Fleiss} \rightarrow -\frac{1}{r-1},\quad \hbox {for} \,M \rightarrow 0\,\, \hbox {or} \,\,M \rightarrow n \end{aligned}$$

These results show the inadequacy of Fleiss’ kappa in interpreting high level of agreement, because this index assumes a constant and negative value even when it would be expected a very high inter-rater agreement.

Fleiss’ kappa does not allow to recognize different degrees of agreement when \(M\) varies. Moreover, it does not allow to discriminate the situations of perfect agreement from other situations. For instance, setting \(M=5\) and \(r=6\) in Table 2, it is obtained:

$$\begin{aligned} K_{Fleiss}=-0.2, \end{aligned}$$

even if five out of six raters totally agree in their judgement.

Another example of Fleiss’ kappa inconsistent behaviour can be showed using Table 4 (Fleiss 1971). This table shows the classification of 30 patients into 5 diagnostic categories by six psychiatrists. When we calculate Fleiss’ kappa on these data, we obtain:

$$\begin{aligned} K_{Fleiss}=0.430. \end{aligned}$$

Merging the last three categories in a new category, we expect an increased agreement, instead the value of kappa decreases:

$$\begin{aligned} K_{Fleiss}=0.205. \end{aligned}$$

4 Calculating Fleiss’ kappa without paradoxical behaviour

In Sect. 3 we noted some particular configurations of category assignments and we analysed the cases in which the kappa statistic underestimated the agreement. The basic idea of our work consists in the use of permutation techniques (Mielke and Berry 2007) to solve the problem of paradoxical behaviour. Permutations do not lead to loss of information on agreement since we only consider categorical data.

Referring to Table 1, we propose, in correspondence of each row \(i\), to randomly choose a permutation of the \(q\) frequencies \(r_{ij}\) and to substitute this new vector instead of the original vector. Finally, it is possible to calculate the Fleiss’ kappa index on the permuted frequency table.

Repeating this procedure \(C\) times and synthesizing the \(C\) values of kappa by means of a robust index, we better quantify the inter-raters agreement. In particular, we propose the use of the median to synthesize the repeated permutation results.

As far as the choice of \(C\) is concerned, the number of all possible permuted tables, starting from Table 1, is equal to \(\left( q!\right) ^n\). This number can be too large for a comprehensive examination, hence, we approximate the result with a smaller number \(C\).

We calculate the value of robust kappa, \(K_r\), applying this technique to diagnostic data in Table 4. We now show how the proposed permutation method solves Fleiss’ kappa paradoxes.

From the original dataset, we calculate the value:

$$\begin{aligned} K_r=0.436 \end{aligned}$$

and from the merged dataset:

$$\begin{aligned} K_r=0.454. \end{aligned}$$

This result shows that \(K_r\), differently from Fleiss’ kappa, detects the increase of inter-rater agreement. The suggested procedure involves a high computational effort. When the size of the table and the number of iterations increase, the required computational time increases dramatically.

From the proposed method to calculate Fleiss’ kappa, it also follows a new method to calculate confidence intervals not affected by paradoxes. In the following Section, we show how to construct an interval estimator based on Bootstrap techniques, according to the procedure proposed for robust kappa.

5 Bootstrap confidence intervals for robust kappa

In this Section we propose the joint use of permutation techniques and resampling methods to construct confidence intervals. The proposed Bootstrap intervals, differently from the standard one (Fleiss et al. 2003), avoid paradoxes and perform well even in case of a small number of subjects \(\left( n\right) \).

Let’s indicate with \(p_{ij}=\frac{r_{ij}}{r}\) (where \(i=1, \dots , n\) and \(j=1, \dots , q\)) the proportion of categorization. The resampling of the \(i\)th row has a multinomial distribution with parameters \(r\) and \(p_{ij} \dots p_{iq}\) (with \(i=1, ..., n\)). We can apply to each resampled table the algorithm described in Sect. 4. In case we repeat this procedure B times, we obtain B values of robust Fleiss’ kappa. Considering the resulting distributions, we can calculate the quantiles of order \(\alpha \) and \(1-\alpha \), respectively representing the lower and upper bounds of the Bootstrap percentile interval at confidence level of \(1-2\alpha \) (Shao and Tu 1995).

Because of the computational effort of permutations, we prefer to use Bootstrap percentile with respect to Bootstrap accelerated bias-corrected percentile (Bca) and Bootstrap-t, that are more accurate (i.e. second order accurate Shao and Tu 1995) but also computationally expensive. In order to assess the results obtained in constructing Bootstrap percentile confidence intervals (which are first order accurate Shao and Tu 1995) at level of \(95~\%\), we compare them to intervals obtained by Bca, Bootstrap-t and Fleiss–Levin–Paik asymptotic method (Fleiss et al. 2003).

According to Fleiss et al. (2003), for n large enough, Fleiss’ kappa has a Normal distribution and the estimated standard error is:

$$\begin{aligned} s_k=\frac{\sqrt{2}}{ \sum ^{q}_{j=1}{p_{ j}\left( 1-p_{ j}\right) }\sqrt{nr\left( r-1\right) }}\sqrt{ \left[ \sum ^{q}_{j=1}{p_{j}\left( 1-p_{ j}\right) }\right] ^2-\sum ^{q}_{j=1}{p_{ j}\left( 1-p_{ j}\right) \left( 1-2p_j\right) }} \end{aligned}$$

From original dataset (Table 4) and merged dataset we obtain the confidence intervals reported in Table 3 (where \(C=100\) and \(B=1000\)).

Table 3 Confidence intervals (at level of \(95~\%\)) for Fleiss’ kappa
Table 4 Frequency of assignment of patients to diagnostic categories (source: Fleiss 1971)

The bootstrap intervals constructed with the proposed method do not lead to the paradox of Fleiss’ kappa and they do not need large sample size, although they require a certain computational effort.

In particular, we can observe that the humble number of subjects is the typical case in the psychometric field and it prevents from the use of the asymptotic approximations.

6 Concluding remarks

In this work we investigated the problem of the underestimation of agreement of Fleiss’ kappa statistic in assessing high levels of inter-raters agreement. Since in case of nominal variables the order of the categories is not relevant, we proposed a solution based on permutation techniques.

In order to avoid the paradoxes of this index (exposed in Sect. 3), we suggest to permute any row of the original dataset. The new permuted matrix has a level of agreement quite similar to the original one, in spite of a different configuration. If we repeat this operation C times, the “new” datasets, and the corresponding values of Fleiss’ kappa can be used to assess the level of agreement of the original dataset. In order to summarize the C values of Fleiss’ kappa, we have proposed the use of a robust statistic (the median), less affected by extreme values, that cause the unexpected performance of Fleiss’ kappa.

The problems of this statistic involve the corresponding confidence interval proposed by Fleiss et al. (2003); this interval is affected by the same paradoxes of Fleiss’ kappa and it is based on an asymptotic Normal approximation (so it is valid only for n large enough). Therefore, we proposed a Bootstrap interval not affected by paradoxes and applicable even when n is too small for Normal approximation.