Reliability and agreement studies are of paramount importance in behavioural, social, and medical sciences. They do contribute to the quality of the studies by providing information about the amount of error inherent to any diagnosis, score or measurement, as for depression diagnosis in mental health care or student progress assessment in educational research. Kottner et al. (2011) provided guidelines for reporting reliability and agreement studies. They advice the use of kappa-like family for categorical and ordinal scales.

Cohen (1960) first introduced the classical kappa coefficient to measure agreement on nominal scales. Based on the classical reliability model for binary scales, Kraemer (1979) showed that the kappa coefficient is a reliability coefficient. This coefficient was then extended to account for situations where disagreements between raters are not all of equal importance. For example, on an ordinal scale, a greater "penalty" can be applied if the two categories chosen by the raters are farther apart. To account for these inequalities, Cohen (1968) introduced weights in the formulation of the agreement coefficient leading to the weighted kappa coefficient. Although the weights can be arbitrarily chosen, those introduced by Cicchetti and Allison (1971) and by Fleiss and Cohen (1973) are the most commonly used. The former depend linearly on the distance between the classification made by the two raters, while the latter depend quadratically on that distance. Quadratic weights are the most popular because of their practical interpretation. Cohen (1968) and Schuster (2004) showed that the quadratic-weighted kappa coefficient is asymptotically equivalent to the intraclass correlation coefficient under a two-way ANOVA model. In other words, the quadratic-weighted kappa coefficient compares the variability between the pairs of items to the total variability. More recently, Warrens (2014) studied the relationship between the quadratic-weighted kappa coefficient and corrected Zegers-ten Berge coefficients under the four definitions of agreement introduced by Stine (1989). A first interpretation of the linear-weighted kappa coefficient, not very convenient in practice, was given only 30 years after its introduction by Vanbelle and Albert (2009) and Warrens (2011). Similarly to Cohen’s kappa coefficient, the linear-weighted kappa coefficient is a weighted average of individual kappa coefficients obtained on \(2 \times 2\) tables constructed by collapsing the first \(k\) categories and last \(K-k\) categories (\(k=1,\ldots ,K-1\)) of the original \(K\times K\) classification table.

All kappa-like coefficients have the particularity of accounting for chance agreement, i.e., for the amount of agreement expected between the two raters if their classification was made randomly. Although such a correction is often a desirable property, it introduces a dependence of the coefficients on the marginal distribution of the raters. Hence, kappa-like coefficients mix two sources of disagreements: (1) bias between the two raters and (2) disagreement on the classification of the items themselves. Criticisms against Cohen’s kappa coefficient are mainly based on this property (e.g., Feinstein and Cicchetti (1990), Cicchetti and Feinstein (1990), Byrt, Bishop, Carlin (1993)). Further criticisms were formulated against weighted kappa coefficients because the weights are arbitrary and the use of different weighting schemes can lead to different conclusions (e.g., Vanbelle (2013)). Unappealing mathematical properties of the quadratic-weighted kappa coefficient were discovered by Brenner and Kliebsch (1996), Yang and Chinchilli (2011), and Warrens (2013c). Warrens (2013a, b, c, d) therefore tends to favor the linear-weighted kappa coefficient. Warrens (2012, 2013d) also studied the ordering between the linear- and the quadratic-weighted kappa coefficients under particular conditions. Unfortunately, to the best of our knowledge, no general relationship between the unweighted and linear- and quadratic-weighted coefficients is established and no clear guideline in the choice of a weighting scheme exists.

This paper focus on weighted kappa coefficients where the weights are functions of the number of categories separating the classification made by the two raters, like Warrens (2013a). After giving the classical definition of the weighted kappa coefficients in Sect. 1, a new simple and practical interpretation of the linear- and quadratic-weighted kappa coefficients will be given in Sect. 2 and illustrated in Sect. 3. Then, in the light of the new interpretation, the equation governing the relationship between the Cohen’s, the linear- and quadratic-weighted kappa coefficients will be provided for a general \(K\)-ordinal scale in Sect. 4. Practical recommendations on the choice of a kappa coefficient will be formulated in Sect. 5. Finally, the new interpretation and the recommendations will be discussed in Sect. 6.

1 Definition of the Kappa-Like Family

Consider two raters who classify items (subjects/objects) from a population \(\mathcal {I}\) on a \(K\)-ordinal scale. Let \(Y_{ir}\) be the random variable such that \(Y_{ir}=k\) if rater \(r\) (\(r=1,2\)) classifies a randomly selected item \(i\) of population \(\mathcal {I}\) in category \(k\) (\(k=1,\ldots ,K\)). Let \(\pi _{i,jk}\) denote the probability for item \(i\) to be classified in category \(j\) by rater 1 and category \(k\) by rater 2. Furthermore, let denote the marginal probability distribution of rater \(1\) by \((\pi _{i,1.},\ldots ,\pi _{i,K.})'\) and of rater 2 by \((\pi _{i,.1},\ldots ,\pi _{i,.K})'\). We assume that across the population of items \(\mathcal {I}\), \(E(\pi _{i,jk})=\pi _{jk}\), \(E(\pi _{i,.k})=\pi _{.k}\) and \(E(\pi _{i,j.})=\pi _{j.}\). The joint probability classification table is presented in Table 1.

Table 1 Joint and marginal probability distribution over the population of items of the classification of a randomly selected item \(i\) on a \(K\)-ordinal scale by 2 raters.

Agreement coefficients of the kappa-like family can be defined in terms of disagreements by

$$\begin{aligned} \kappa ^{(s)}_{v}=1-\frac{\zeta ^{(s)}_{o,v}}{\zeta ^{(s)}_{e,v}}, \end{aligned}$$

where \(\displaystyle \zeta ^{(s)}_{o,v}=\sum _{j=1}^{K}\sum _{k=1}^{K}v^{(s)}_{jk}\pi _{jk}\) is the observed weighted disagreement and \(\displaystyle \zeta ^{(s)}_{e,v}=\sum _{j=1}^{K}\sum _{k=1}^{K}v^{(s)}_{jk}\pi _{j.}\pi _{.k}\) is the weighted disagreement expected by chance. Usually, \(\displaystyle 0\le v^{(s)}_{jk}\le 1\) and \(v^{(s)}_{jj}=0\) \((j,k=1,\ldots ,K; s \in \mathbb {N})\) where \(\mathbb {N}\) is the set of positive integers. Yang and Chinchilli (2009) showed that kappa coefficients vary between \(-1\) and 1. As a consequence, the observed weighted disagreement will never be larger than twice the chance weighted disagreement.

The weights corresponding to Cohen’s kappa coefficient are \(v^{(0)}_{jk}=1\) for \(j \ne k\) and \(v^{(0)}_{jj}=0\) otherwise (\(j,k=1,\ldots ,K\)). Cohen’s kappa coefficient therefore compares the observed probability of disagreement to the probability of disagreement expected by chance. If \(\kappa ^{(0)}_{v}=x\), the observed probability of disagreement between the two raters’ classifications is \((1-x)\) times the probability of disagreement expected by chance. Perfect agreement (\(\kappa ^{(0)}_{v}=1\)) is obtained when no disagreement is observed. A value of zero indicates that the probability of disagreement is only to be expected by chance, while negative values express that the observed probability of disagreement is larger than what is expected by chance.

Although weights can be arbitrarily defined, two weighting schemes based on the number of categories separating the classification made by the two raters are most commonly used. Cicchetti and Allison (1971) proposed linear weights of the form \(v^{(1)}_{jk}=|j-k|/(K-1)\), whereas Fleiss and Cohen (1973) used quadratic weights \(v^{(2)}_{jk}=(j-k)^2/(K-1)^2\). Since weighted kappa coefficients are invariant over any positive multiplicative transformation of the weights (Cohen, 1968), the unscaled form of the weights \(v^{(s)}_{jk}=|j-k|^s\) (\(s \in \mathbb {N}\)) will be used for convenience.

These power-weighted kappa coefficients will further be expressed according to the number of categories \(m\) separating the classification made by the two raters. Let \(m=|j-k|\). Then, \(v_m^{(s)}=m^{s}\) \((m=0,\ldots ,K-1; j,k=1,\ldots ,K\), \(s \in \mathbb {N}\)). The observed and expected weighted disagreements of order \(s\) are then given, respectively, by

$$\begin{aligned} \zeta ^{(s)}_{o,v}=\sum _{m=1}^{K-1}v^{(s)}_{m}\sum _{j=1}^{K-m}\left( \pi _{j(j+m)}+\pi _{(j+m)j}\right) =\sum _{m=1}^{K-1}v^{(s)}_{m}\nu _m \end{aligned}$$

and

$$\begin{aligned} \zeta ^{(s)}_{e,v}=\sum _{m=1}^{K-1}v^{(s)}_{m}\sum _{j=1}^{K-m}\left( \pi _{j.}\pi _{.(j+m)}+\pi _{(j+m).}\pi _{.j}\right) =\sum _{m=1}^{K-1}v^{(s)}_{m}\xi _m. \end{aligned}$$

The linear and quadratic disagreement weights are \(v^{(1)}_m=m\) and \(v^{(2)}_m=m^2\), respectively.

2 A New Eye on the Weighted Kappa Coefficients

Suppose that interest lies in the agreement level between two raters classifying items on a \(K\)-ordinal scale. Since agreement is often defined in terms of closeness between ratings (Stine, 1989; Warrens, 2014), the quantification of agreement levels is best based on the distance between the ratings. We define the distance between the two classifications as the number of categories separating the two raters’ classifications. Let the random variables \(Y_{ir}\) denote the classification of item \(i\) by rater \(r\) on the \(K\)-ordinal scale (\(i\in \mathcal {I},r=1,2\)), as defined in the previous section. These random variables follow a \(K\)-categorical distribution, i.e., \(Y_{i1}\sim \text{ cat }(\pi _{1.},\ldots ,\pi _{K.})\) and \(Y_{i2}\sim \text{ cat }(\pi _{.1},\ldots ,\pi _{.K})\). The random variable \(Z_i=|Y_{i1}-Y_{i2}|\) then denotes the number of categories separating the classification made by the two raters. This random variable expresses the strength of disagreement between the two raters in the absolute sense (Stine, 1989). A value of 0 is associated with perfect agreement, while positive values represent disagreement. The larger the value, the stronger the disagreement. We have \(Z_i\sim \text{ cat }(\nu _0,\ldots ,\nu _{K-1})\), where \(\nu _0=\sum _{j=1}^{K}\pi _{jj}\) and \(\nu _m=\sum _{j=1}^{K-m}(\pi _{j(j+m)}+\pi _{(j+m)j})\) \((m=1,\ldots ,K-1)\). Under the chance assumption, \(Z^{s}_{i|\mathrm{ind}}\sim \text{ cat }(\xi _0,\ldots ,\xi _{K-1})\) with \(\xi _0=\sum _{j=1}^{K}\pi _{j.}\pi _{.j}\) and \(\xi _m=\sum _{j=1}^{K-m}(\pi _{j.}\pi _{.(j+m)}+\pi _{(j+m).}\pi _{.j})\) \((m=1,\ldots ,K-1)\).

While centered moments are most commonly used to describe the shape of statistical distributions, raw moments of the random variable \(Z_i\) have a particular meaning within the context of agreement since the category 0 corresponds to perfect agreement between the two raters. The shape of the distribution of the disagreement strength can therefore classically be summarized using the first two raw moments, namely the mean and the center of inertia about 0, which are in fact the observed linear-and quadratic-weighted disagreement since we have

$$\begin{aligned} E(Z_i)=\sum _{m=1}^{K-1}m\nu _m=\zeta ^{(1)}_{o,v} \quad \hbox { and } \quad E\left( Z_i^2\right) =\sum _{m=1}^{K-1}m^2\nu _m=\zeta ^{(2)}_{o,v} . \end{aligned}$$

More generally, the observed weighted disagreement of order \(s\) is the \(s\)th raw moment of the distribution of the distance between the two raters’ classifications (\(s\in \mathbb {N}\)). A similar interpretation can be given for the expected weighted disagreement of order \(s\).

$$\begin{aligned} E(Z^{s}_i)=\sum _{m=1}^{K-1}m^s\nu _m=\zeta ^{(s)}_{o,v} \quad \text{ and } \quad E(Z^s_{i|\mathrm{ind}})=\sum _{m=1}^{K-1}m^s\xi _m=\zeta ^{(s)}_{e,v}. \end{aligned}$$

2.1 Linear-Weighted Kappa Coefficient: A Position Parameter

The observed linear-weighted disagreement \(\zeta ^{(1)}_{o,v}\) is the first moment of the distribution of \(Z_i\), i.e., the mean distance (number of categories) between the classifications made by the two raters. In the same way, \(\zeta ^{(1)}_{e,v}\) is the mean distance expected by chance. The linear-weighted kappa therefore compares the mean distance between the classifications made by the two raters to the mean distance expected by chance and can thus be interpreted as the chance-corrected mean distance between the two classifications:

$$\begin{aligned} \kappa ^{(1)}_{v}=1-\frac{\text{ Mean } \text{ distance } \text{ between } \text{ the } \text{ two } \text{ classifications }}{\text{ Mean } \text{ distance } \text{ between } \text{ the } \text{ two } \text{ classifications } \text{ expected } \text{ by } \text{ chance }}. \end{aligned}$$

If \(\kappa ^{(1)}_{v}=x\), the observed mean distance between the two raters’ classifications is \((1-x)\) times the mean distance expected by chance. Perfect agreement (\(\kappa ^{(1)}_{v}=1\)) is obtained when the observed mean distance between the two classifications is null, i.e., there is no disagreement. A value of zero indicates that the observed mean distance is only to be expected by chance, while negative values express that the observed mean distance is larger than the mean distance expected by chance.

2.2 Quadratic-Weighted Kappa Coefficient: A Concentration Parameter

The observed quadratic-weighted disagreement \(\zeta ^{(2)}_{o,v}\) is the second raw moment of the distribution of \(Z_i\), i.e., the moment of inertia of the distance distribution between the two raters’ classifications about the axis formed by the agreement cells. It therefore gives a measure of concentration (or variability) of the distance distribution around 0. In the same way, \(\zeta ^{(2)}_{e,v}\) corresponds to the center of inertia expected by chance. The quadratic-weighted kappa therefore compares the observed center of inertia (concentration) of the distance distribution between two raters’ classifications about 0 to the center of inertia (concentration) expected by chance. It can be interpreted as the chance-corrected measure of inertia about 0 of the distance distribution between the two raters’ classifications:

$$\begin{aligned} \kappa ^{(2)}_{v}=1-\frac{\text{ center } \text{ of } \text{ inertia } \text{ about } \text{0 } \text{ of } \text{ the } \text{ distance } \text{ between } \text{ the } \text{ two } \text{ classifications }}{\text{ center } \text{ of } \text{ inertia } \text{ about } \text{0 } \text{ expected } \text{ by } \text{ chance }}. \end{aligned}$$

If \(\kappa ^{(2)}_{v}=x\), the observed center of inertia of the distance distribution between the two raters’ classifications about 0 is \((1-x)\) times the one expected by chance. Perfect agreement (\(\kappa ^{(2)}_{v}=1\)) means that the center of inertia is 0, i.e., the distribution of the observations is concentrated in the agreement cells. A quadratic-weighted kappa of 0 means that the observed concentration is only to be expected by chance, while negative values state for a distribution of the distance between the two raters’ classifications more dispersed than what was expected by chance.

3 Example

The contingency table given in Cohen (1968) is reproduced in Table A of Table 2 in terms of proportions. It summarizes the classification of patients by two psychiatrists in 3 diagnostic categories (\(1=\) personality disorder, \(2=\) neurosis, \(3=\) psychosis), ordinal in terms of the seriousness of the disease.

Table 2 \(3\times 3\) contingency table from the paper of Cohen (1968) (Table A) and contingency table with the same linear-weighted kappa coefficient (Table B) and quadratic-weighted kappa coefficient (Table C).

In Table A, the distance between the two raters’ classifications follows a 3-categorical distribution \(Z_{i}\sim \text{ cat }(0.70,0.20,0.10)\). Therefore, the probability of disagreement is equal to \({\hat{\zeta }}^{(1)}_{o,v}=0.30\), the mean distance between the two classifications is \({\hat{\zeta }}^{(1)}_{o,v}=0.40\) category, and the center of inertia about the agreement axis is at \({\hat{\zeta }}^{(2)}_{o,v}=0.6\) category. Under the chance assumption, the probability distribution becomes \(Z^{s}_{i|\mathrm{ind}}\sim \text{ cat }(0.41,0.42,0.17)\). This gives a probability of disagreement of \({\hat{\zeta }}^{(0)}_{e,v}=0.59\), a mean distance between the two classifications of \({\hat{\zeta }}^{(1)}_{e,v}=0.76\) category, and a center of inertia about the agreement axis located at \({\hat{\zeta }}^{(2)}_{e,v}=1.10\) categories. Cohen’s kappa coefficient is then equal to \({\hat{\kappa }}^{(0)}_{v}=0.49\), the linear-weighted kappa to \({\hat{\kappa }}^{(1)}_{v}=0.47\) and the quadratic-weighted kappa to \({\hat{\kappa }}^{(2)}_{v}=0.45\). The conclusion is therefore that the probability of disagreement, equal to 0.30, is 2 times smaller than what was expected by chance. The mean distance, equal to 0.40 category, is 0.53 times the mean distance expected by chance. Finally, the center of inertia about the agreement cells, equal to 0.6 category, is 0.55 times the center of inertia expected by chance. As noted by Warrens (2013a), a quadratic-weighted kappa coefficient smaller than the linear-weighted kappa is seldom encountered in practice. This reflects that the gain in dispersion, with respect to the chance configuration, is smaller than the gain in location. It is likely that a non-negligible part of the disagreements are located far from the agreement cells.

Two hypothetical tables (Table B and C in Table 2) were constructed to stress the fact that the linear- and quadratic-weighted kappa coefficients gives complementary information on the disagreement distribution. To permit the comparison with Table A, the same marginal probability distributions were used for Table B and C. The observed weighted disagreement and the weighted kappa coefficient corresponding to Tables A, B, and C are reported in Table 3. When comparing Tables A and B, the linear-weighted kappa coefficient is the same, while the quadratic-weighted kappa is higher in Table B than in Table A. This means that despite the same mean distance between the two ratings in Table A and B, data are more concentrated along the agreement cells in Table B than in Table A. On the contrary, while Table A and C show the same concentration level along the agreement cells, the mean distance between the two classifications is larger in Table C than in Table A. Reporting both linear- and quadratic-weighted kappa coefficients in ordinal agreement studies will therefore better describe the shape of the disagreement distribution than reporting only one of the two coefficients. Moreover, reporting the highest weighted kappa coefficient is arbitrary and should be discouraged.

Table 3 Cohen’s kappa, linear- and quadratic-weighted kappa coefficients for Table A, B and C.

4 Algebraic Relationships Between Kappa Coefficients

The linear- and quadratic-weighted disagreements are related through the variance of the distribution of \(Z_i\) by

$$\begin{aligned} \text{ var }(Z_i)=\zeta ^{(2)}_{o,v}-(\zeta ^{(1)}_{o,v})^2. \end{aligned}$$

This implies that \(\zeta ^{(1)}_{o,v}<(\zeta ^{(2)}_{o,v})^{\frac{1}{2}}\). This inequality is stronger than the inequality imposed by the classical definition of the observed weighted agreement, i.e., \(\zeta ^{(1)}_{o,v}<\zeta ^{(2)}_{o,v}\). Unfortunately, this relationship cannot be transposed in terms of weighted kappa coefficients since both weighted coefficients are relative measures with respect to the chance assumption. Deviations from the values expected by chance for the mean can be larger or smaller than deviations for the center of inertia depending of the configuration of the joint probability distribution table, as illustrated in Sect. 3. However, it is possible to write a linear relationship between Cohen’s kappa and the power-weighted kappas in the light of the new interpretation of the weighted kappa coefficients given in Sect. 2. Beforehand, the linear relationship between the first \(K-1\) raw moments of a \(K\)-categorical variable is derived in Lemma 1.

Lemma 1

Let the random variable \(Z_i\) follow a \(K\)-categorical distribution \(Z_i\sim \text{ cat }(\nu _0,\ldots ,\nu _{K-1})\). Then, we have

$$\begin{aligned} \sum _{j=0}^{K-1}S\left( K,j+1\right) E\left( Z_i^{j}\right) =0, \end{aligned}$$

where \(S(K,j+1)\) are the first kind signed Stirling numbers (\(j=1,\ldots ,K-1\)).

Proof

We have

$$\begin{aligned} \sum _{j=0}^{K-1}S(K,j+1)E(Z_i^{j})&= \sum _{j=0}^{K-1}S(K,j+1)\sum _{t=1}^{K-1}t^j\nu _j\\&= \sum _{t=1}^{K-1}\nu _j\sum _{j=0}^{K-1}S(K,j+1)t^j=\sum _{t=1}^{K-1}\frac{\nu _j}{t}\sum _{s=1}^{K}S(K,s)t^s.\\ \end{aligned}$$

By definition of the signed Stirling numbers, \(\sum _{s=1}^{K}S(K,s)t^s=t(t-1)\cdots (t-(K-1))\). Therefore,

$$\begin{aligned} \sum _{j=0}^{K-1}S(K,j+1)E\left( Z_i^{j}\right)&= \sum _{t=1}^{K-1}\frac{\nu _j}{t}t(t-1)\cdots (t-(K-1))=0. \end{aligned}$$

\(\square \)

Using this result, it will be shown in Theorem 1 that on a \(K\)-ordinal scale, Cohen’s kappa coefficient is a linear combination of the first \(K-1\) power-weighted kappa coefficients.

Theorem 1

Let \(\kappa ^{(s)}_{v}\) denote the weighted kappa coefficient of order \(s\) (\(s\in \mathbb {N}\)) obtained between two raters on a \(K\)-ordinal scale, i.e.,

$$\begin{aligned} \kappa ^{(s)}_{v}=1-\frac{\zeta ^{(s)}_{o,v}}{\zeta ^{(s)}_{e,v}}=1-\frac{E(Z^s_i)}{E(Z^s_{i|\mathrm{ind}})}, \end{aligned}$$

with \(Z_i\sim \text{ cat }(\nu _0,\ldots ,\nu _{K-1})\) and \(Z^{s}_{i|\mathrm{ind}}\sim \text{ cat }(\xi _0,\ldots ,\xi _{K-1})\), as defined in Sect. 2. We have

$$\begin{aligned} \kappa ^{(0)}_{v}=(-1)^{K}\sum _{j=1}^{K-1}S(K,j+1)\frac{\zeta ^{(j)}_{e,v}}{(K-1)!\zeta ^{(0)}_{e,v}}\kappa ^{(j)}_{v} \end{aligned}$$

where \(S(K,j+1)\) are first kind signed Stirling numbers (\(j=1,\ldots ,K-1\)).

In particular,

$$\begin{aligned} \kappa ^{(0)}_{v}&= \frac{3\zeta ^{(1)}_{ev}}{2\zeta ^{(0)}_{ev}}\kappa ^{(1)}_{v}-\frac{\zeta ^{(2)}_{ev}}{2\zeta ^{(0)}_{ev}}\kappa ^{(2)}_{v} \text{ in } 3\times 3 \text{ tables, }\\ \kappa ^{(0)}_{v}&= \frac{11\zeta ^{(1)}_{ev}}{6\zeta ^{(0)}_{ev}}\kappa ^{(1)}_{v}-\frac{6\zeta ^{(2)}_{ev}}{6\zeta ^{(0)}_{ev}}\kappa ^{(2)}_{v}+\frac{\zeta ^{(3)}_{ev}}{6\zeta ^{(0)}_{ev}}\kappa ^{(3)}_{v} \text{ in } 4\times 4 \text{ tables } \text{ and }\\ \kappa ^{(0)}_{v}&= \frac{50 \zeta ^{(1)}_{ev}}{24\zeta ^{(0)}_{ev}}\kappa ^{(1)}_{v}-\frac{35\zeta ^{(2)}_{ev}}{24 \zeta ^{(0)}_{ev}}\kappa ^{(2)}_{v}+\frac{10\zeta ^{(3)}_{ev}}{24\zeta ^{(0)}_{ev}}\kappa ^{(3)}_{v}-\frac{\zeta ^{(4)}_{ev}}{24\zeta ^{(0)}_{ev}}\kappa ^{(4)}_{v} \text{ in } 5\times 5 \text{ tables. }\\ \end{aligned}$$

Proof

We have to prove that

$$\begin{aligned} (-1)^{K}(K-1)!\zeta ^{(0)}_{ev}\left( 1-\frac{\zeta ^{(0)}_{o,v}}{\zeta ^{(0)}_{e,v}}\right) =\sum _{j=1}^{K-1}S(K,j+1)\zeta ^{(j)}_{ev}\left( 1-\frac{\zeta ^{(j)}_{o,v}}{\zeta ^{(j)}_{e,v}}\right) , \end{aligned}$$

i.e. that

$$\begin{aligned} \sum _{j=0}^{K-1}S\left( K,j+1\right) \left[ E\left( Z_{i|\mathrm{ind}}^j\right) -E\left( Z_i^j\right) \right] =0. \end{aligned}$$

Since \(Z^{s}_{i|\mathrm{ind}}\) and \(Z_i\) both follow a \(K\)-categorical distribution, this follows directly from Lemma 1. \(\square \)

Therefore, conditionally on the value of the other members, there is a linear relationship between two members of the kappa-like family. The slope and the intercept of this linear relationship only depend on the marginal distribution of the two raters and the number of categories of the scale.

5 Motivation in the Choice of a Kappa Coefficient

5.1 Crude or Chance-Corrected Agreement Coefficients

Several authors suggested the use of a crude measure of (dis)agreement, i.e., \(\zeta ^{(s)}_{o,v}\) or linear transforms of it (see Warrens (2012) for an overview) instead of the use of kappa coefficients, their chance-corrected counterpart. The main argument to do so is to avoid the dependency of the agreement coefficients on the marginal probability distribution of the raters. However, Rogot and Goldberg (1966) well illustrate on binary scales why crude agreement measures should not be considered. An extension of their argument is applied here for a 3-ordinal scale. Consider the two contingency tables resulting from the classification of 120 items by 2 raters on a 3-ordinal scale (see left and right tables in Table 4).

Table 4 Hypothetical classification of 120 items by two raters on a 3-ordinal scale (left and right tables) with the same crude disagreement (0.4 with linear weights and 0.6 with quadratic weights).

While the crude disagreement is equal for both tables (0.4 with linear weights and 0.6 with quadratic weights), whether agreement is equally good is highly questionable. There is indeed no agreement at all on categories 2 and 3 in the right table. The difference between the two cases emerges from differences in the marginal probability distribution of the raters. The chance-corrected agreement measures take these marginal probability distributions into account. The linear-weighted agreement coefficient is equal to 0.55 for the left table and 0.20 for the right table, while the quadratic-weighted kappa coefficient is equal to 0.55 and 0.27, for the left and right table, respectively. As underlined by several authors (Vach, 2005; Kraemer, Vyjeyanthi, & Noda, 2004; Kottner et al. 2011), low kappa values principally indicate the inability of a scale to distinguish clearly between items of a population in which those distinctions are very rare or difficult to achieve. This is not a flaw of the kappa coefficients. It is therefore advised to complete the information given by crude agreement measures with chance-corrected measures.

5.2 Choice of the Weighting Scheme

Classically, Cohen’s kappa coefficient is used on nominal scales and weighted agreement coefficients on ordinal scales. Cohen’s kappa coefficient can, however, be used for ordinal scales when all disagreements are assumed to be equally important. For example, in diagnostic decision making, this could be in terms of consequences for the patient.

When disagreements cannot be considered as having the same importance, reporting both linear- and quadratic-weighted kappa coefficients will provide more information on the distribution of disagreement than reporting one coefficient alone, as illustrated in Sect. 3. Indeed, as a general statistical principle, the use of a position and a variability parameter better describe a distribution than the use of one parameter alone. In particular, Lipsitz (1992) showed that the distribution of any \(K\)-ordinal random variable \(Z_i\sim \text{ cat }(\nu _0,\ldots ,\nu _{K-1})\) can be alternatively parametrized using its \(K-1\) first centered moments instead of the categories probabilities \(\nu _m\) \((m=0,\ldots ,K-1\)). This means that reporting the linear- and quadratic-weighted kappa coefficients for 3-ordinal scales completely specifies the shape of the disagreement distribution but that information will be lost for scales of higher dimensions.

If only one coefficient has to be chosen, the linear-weighted kappa coefficient is advised because (1) a position parameter is first used to summarize a statistical distribution, (2) the interpretation of the linear-weighted kappa in terms of mean distance between the two raters’ classifications is very simple, (3) the quadratic-weighted kappa coefficient possesses unappealing mathematical properties (Yang & Chinchilli, 2011; Warrens, 2013c), and (4) the linear-weighted kappa coefficient (a position parameter) is less influence by the choice of the number of categories of the scale than the quadratic-weighted kappa coefficient (a variability parameter) (Brenner & Kliebsch, 1996).

6 Discussion

Weighted kappa coefficients are commonly used to quantify agreement between two raters on \(K\)-ordinal scales. Two main criticisms are formulated against their use: (1) they are chance-corrected coefficients and (2) the weights are arbitrarily defined. In Sect. 5, we reiterate the arguments of Vach (2005), Kraemer et al. (2004) and Kottner et al. (2011) in favor of the use of chance-corrected agreement coefficients rather than crude agreement coefficients. It is not a flaw of the kappa coefficients to present with low values despite low observed disagreement. This principally indicates the inability of the scale to distinguish clearly between items of a population in which those distinctions are very rare or difficult to achieve.

In Sect. 2, we provide rationale for the use of the linear and quadratic weights, the two weighting schemes most commonly used in practice. By defining the strength of disagreement as the number of categories separating the classifications made by the two raters, the linear- and quadratic-weighted kappa coefficients are respectively a position and a variability parameter of the distribution of this random variable, like the mean and the standard deviation in classical statistical problems. In particular, the linear-weighted kappa coefficient provides the change in the mean distance between the two raters’ classifications with respect to what is expected by chance, while the quadratic-weighted kappa coefficient provides changes in the center of inertia about the agreement cells. The use of the linear- and quadratic-weighting schemes is therefore justified since statistical distributions are usually primarily described in terms of location and variability parameters. Both coefficients should ideally be reported since they provide complementary information on the distribution of the disagreements. If only one coefficient has to be reported, the use of the linear-weighted kappa coefficient is recommended mainly because a probability distribution is first described in terms of location and the quadratic-weighted kappa coefficient possesses unappealing mathematical properties (Yang & Chinchilli, 2011; Warrens, 2013c; Brenner & Kliebsch, 1996).

The new interpretation of the linear-weighted kappa coefficient in terms of mean distance between the two raters’ classifications has the advantage to be more practical than the interpretation initially proposed by Vanbelle and Albert (2009). On another hand, the interpretation of the quadratic-weighted kappa coefficient in terms of moment of inertia offers two advantages over the intraclass interpretation: (1) the interpretation is not asymptotic and (2) it avoids the problem of the interpretation of negative values.

While this paper focus on weighted kappa coefficients, other agreement measures can be used depending of the definition of agreement adopted. Stine (1989) proposed to classify agreement obtained on metric scales in four classes, depending on scale transformations allowed to maintain perfect agreement levels. This classification was extended to ordinal scales by Warrens (2014) by replacing metric scores by the category scores \(k=1,\ldots ,K\). The most common class of agreement is the absolute class, where two raters are said to be in agreement if they provide exactly the same classification of the items. Agreement coefficients for this class should therefore be sensitive in location and in variability differences in the two raters’ classifications. This is the case of both the power family of weighted kappa coefficients and the intraclass correlation of the absolute form (\(ICC(A,1)\) in McGraw and Wong (1996) or \(ICC(2,1)\) in Shrout and Fleiss (1979)). The use of the intraclass correlation coefficient is based on a reliability model, which has to be appropriate since inferences are based on the F-distribution. The reader is referred to Shrout and Fleiss (1979), McGraw and Wong (1996) for more details on the different models. On the other hand, the use of the linear- and quadratic-weighted kappa coefficients only requires the assumption that the distribution of the distance between the two classifications is ordinal.

The second class is the additive class. Two raters are said to perfectly agree even if their classification differ by a constant number of categories, e.g., even if there is a difference of \(a\) categories between the two raters’ classifications for all items (\(a\in 0,\ldots ,K-1\)). Agreement coefficients for this class are therefore sensitive to variability differences in the two raters’ classifications but not in location differences, like the intraclass correlation coefficients of the consistency form (\(ICC(C,1)\) in McGraw and Wong (1996) and \(ICC(3,1)\) in Shrout and Fleiss (1979)). Here too, the use of the intraclass correlation is conditional on the appropriateness of the underlying reliability model.

The third class is the ratio class. Two raters are said to perfectly agree even if one rating is equal to \(b\) times the second rating \(b\in \mathbb {R}\). Agreement coefficients for this class are therefore sensitive to location differences but not in variability differences in the two classifications. Finally, in the ratio class, any positive linear transformation of the classifications is allowed. An agreement coefficient for this class is therefore not sensitive to differences in location or in the variability of the classifications. This is the case of Spearman’s and Pearson correlation coefficients.

Note that, when the ordinal scale can be viewed as a categorization of an underlying unidimensional continuous variable with normal distribution, the polychoric correlation giving the correlation between the two underlying scales can be used (Pearson, 1900). The choice of an appropriate agreement measure therefore depends on the choice of an appropriate agreement definition and on the suitability of underlying mathematical assumptions with the agreement study. We hope the new interpretation provided in this paper will help researchers in the motivation of their weighting scheme choice in ordinal agreement studies if they choose to use weighted agreement coefficients.