On the Usefulness of Interrater Reliability Coefficients

ten Hove, Debby; Jorgensen, Terrence D.; van der Ark, L. Andries

doi:10.1007/978-3-319-77249-3_6

Debby ten Hove⁶,
Terrence D. Jorgensen⁶ &
L. Andries van der Ark⁶

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 233))

Included in the following conference series:

The Annual Meeting of the Psychometric Society

1758 Accesses
9 Citations

Abstract

For four data sets of different measurement levels, we computed 20 coefficients that estimate interrater reliability. The results show that the coefficients provide very different numerical values when applied to the same data. We discuss possible explanations for the differences among coefficients and suggest further research that is needed to clarify which coefficient a researcher should use to estimate interrater reliability.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Statistical Assessment of Agreement

Intraclass correlation coefficients: clearing the air, extending some cautions, and making some requests

Article 23 August 2016

Disagreement Plots and the Intraclass Correlation in Agreement Studies

Keywords

1 Introduction

Interrater reliability (IRR) entails the degree of agreement, consistency, or shared variance among two or more raters assessing the same subjects, expressed as a number between 0 (no agreement) and 1 (perfect agreement). On September 27, 2017, the term “inter-rater reliability”—including quotation marks—returned 173,000 hits on Google Scholar, which illustrates its academic importance. IRR also has societal relevance. For example, in the Netherlands an officer of Child Protection Services (Raad voor de Kinderbescherming) assesses the recidivism risks, risk factors, and protective factors of each juvenile delinquent (Van der Put et al. 2011). For the juvenile delinquent, the stakes are high because the assessment by the officer of Child Protection Services determines the district attorney’s sentencing recommendation. If the IRR of the assessment procedure were low, the sentencing recommendation would largely depend on the officer who did the assessment, which is highly undesirable.

In our experience, most researchers associate IRR with Cohen’s (1960) kappa, but there is an abundance of coefficients available. Just for nominal data, Popping (1988) identified over 38 coefficients. Zhao et al. (2013) discussed 22 of these coefficients and found several were mathematically equivalent, resulting in 11 unique coefficients. The R package irr (Gamer et al. 2012) contains 17 different coefficients for various types of data that estimate the IRR. Some coefficients have different versions, which increases the number of coefficients even further. For example, the intraclass correlation coefficient (ICC) can be calculated using a one-way or two-way model, to estimate the consistency or agreement of either a single rating or the average across raters. Due to the abundance of coefficients, we found that preferring a particular coefficient to estimate IRR is hard to justify. Despite review articles on IRR (e.g., Gwet 2014; Hallgren 2012), it is unknown to what degree the estimated IRR depends on the coefficient.

It would be desirable if coefficients that can be applied to data with the same measurement level (e.g., nominal data) produce similar results. Therefore, this paper investigates to what degree the choice of coefficient affects the estimated IRR. In the discussion, we attempt to explain some of the differences among coefficients, and suggest research that is needed to answer the question: “Which coefficient should a researcher use to estimate interrater reliability?”.

2 Methods

2.1 Data

We selected four datasets that are freely available from the R package irr (see Table 1; Gamer et al. 2012). Each dataset contained the ratings of R raters observing S subjects. The dataset Diagnoses (Fleiss 1971) consists of ratings by six psychiatrists classifying 30 patients into one of five nominal diagnostic categories: depression, personality disorder, schizophrenia, neurosis, or other. The dataset Vision (Stuart 1953) consists of the distance-vision performance of 7477 subjects using their left eye and their right eye. The two eyes are considered the two instruments (i.e., two raters). The ratings were measured on a scale from 1 (low performance) to 4 (high performance), which we treat as ordinal. The dataset Video is an artificial dataset consisting of four raters rating the credibility of 20 videotaped testimonies. Ratings could vary from 1 (not credible) to 6 (highly credible), though observed scores only ranged from 2 to 5. Technically, rating scales cannot yield interval-level data unless it can be known that the distance between adjacent integers is equivalent for any pair of adjacent integers across the range of the scale; however, unbiased results may be obtained by treating Likert-type rating scales containing at least five points as interval-level rather than ordinal-level data (Rhemtulla et al. 2012). Therefore, we treated the ratings as interval-level data. The dataset Anxiety is also an artificial dataset, in which three raters rated the anxiety of 20 subjects on a scale from 1 (not anxious at all) to 6 (extremely anxious). The measurement level of these ratings was also treated as interval.

Table 1 Characteristics of the four datasets

Full size table

2.2 IRR Coefficients

We considered 20 IRR coefficients from the R package irr (version 0.84; Gamer et al. 2012). We considered nine coefficients for nominal ratings (Table 2, top panel). Cohen’s kappa (\( \kappa \); Cohen 1960) can be used only for nominal ratings with two raters. Weighted versions of \( \kappa \) have been derived that can also be used only for nominal ratings with two raters (Cohen 1968). The weights reflect the amount of disagreement between the raters. We calculated two weighted \( \kappa \) versions: \( \kappa \) with equal weights (\( \kappa_{W} \)) and with squared weights (\( \kappa_{{W^{2} }} \)). Three generalizations of \( \kappa \) were available to assess nominal data with more than two raters: Fleiss’ kappa (\( \kappa_{\text{Fleiss}} \); Fleiss 1971), Conger’s exact kappa (\( \kappa_{\text{Exact}} \); Conger 1980), and Light’s kappa (\( \kappa_{\text{Light}} \); Light 1971). The percent agreement, Krippendorff’s (1980) alpha, and coefficient iota (Janson and Olson 2001) each have a version for several measurement levels, including nominal-level ratings. Their coefficients for nominal ratings are denoted \( PA_{N} \), \( \alpha_{N} \), and \( \iota_{N} \), respectively.

Table 2 Characteristics of the 20 IRR coefficients used in this study

Full size table

We considered four coefficients for ordinal ratings (Table 2, central panel). Kendall’s (1948) W and the mean of Spearman’s rank-order correlation (\( \bar{\rho } \); Spearman 1904) have been designed specifically for ordinal data, whereas the percent agreement and Krippendorff’s (1980) alpha have a version for ordinal ratings. The latter two coefficients are denoted \( PA_{O} \) and \( \alpha_{O} \), respectively.

We considered seven coefficients for interval-level ratings (Table 2, bottom panel). Each coefficient can also be applied to ratio-level ratings. The percent agreement, Krippendorff’s (1980) alpha, and coefficient iota (Janson and Olson 2001) have a version for interval ratings. These coefficients are denoted \( PA_{I} \), \( \alpha_{I} \), and \( \iota_{I} \) respectively. For the Finn (1970) coefficient and the ICC (Shrout and Fleiss 1979), we specified two-way models to treat both raters and subjects as each being randomly drawn from a population, which is often the case in social and behavioral research. In addition, for the ICC we computed the level of consistency rather than the level of absolute agreement. Furthermore, we computed the mean of Pearson’s product-moment correlation coefficients (\( \bar{r} \); Pearson 1895) and Robinson’s measure of agreement (A; Robinson 1957).

We excluded three coefficients of the R package irr from our analyses, because they clearly measured something different than the IRR: the Stuart-Maxwell coefficient (Maxwell 1970) and the Bhapkar (1966) coefficient assess homogeneity in marginal distributions, and the coefficient of Eliasziw et al. (1994) estimates intrarater reliability (i.e., consistency of repeated ratings from the same rater).

2.3 Analyses

For the nominal dataset (Diagnoses), we applied only nominal IRR coefficients. For the ordinal dataset (Vision), we applied all ordinal, nominal, and interval-level IRR coefficients, with the exception of \( \alpha_{N} \) and \( \alpha_{I} \). The results of interval-level coefficients are interesting because researchers frequently treat Likert-type scales as though they are continuous. The results of nominal IRR coefficients are interesting when the ordering is not of primary interest in the application at hand. Therefore, for the interval-level datasets (Video and Anxiety), we also computed all nominal, ordinal, and interval-level IRR coefficients, with the exception of \( PA_{N} , PA_{O} , \alpha_{N} , \alpha_{O} , \) and \( \iota_{N} \).

We investigated the range of values obtained by these coefficients. We also investigated whether the choice of coefficient affects the conclusion about the IRR using the heuristic labels suggested by Landis and Koch (1977) for the use of \( \kappa \): negative values indicate a poor IRR, values between 0 and 0.20 indicate a slight IRR; values between 0.21 and 0.40 indicate a fair IRR; values between 0.41 and 0.60 indicate a moderate IRR; values between 0.61 and 0.80 indicate a substantial IRR, and values between 0.81 and 1.00 indicate an almost perfect IRR.

Furthermore, we investigated the following aspects of the IRR coefficients in Table 2, by checking the literature and the functions of the package irr : Are standard errors available? Is it possible to conduct null-hypothesis significance testing? Are missing data allowed? And if so, how can missing data be handled? How many raters are allowed?

3 Results

Table 3 shows the variability of the evaluated IRR coefficients as estimated for the four datasets. For the nominal-level dataset Diagnoses, the six available IRR coefficients ranged from 0.17 (\( PA \)) to 0.46 (\( \kappa_{\text{Light}} \); \( M = \,0.40 \), \( SD = \,0.11 \)). For the ordinal-level dataset Vision, the IRR coefficients ranged from 0.60 (several coefficients) to 0.85 (W; \( M = \,0.69, \,SD = \,0.09 \)), but from 0.71 (several coefficients) to 0.85 if only ordinal IRR coefficients are considered. For the interval-level dataset Video, the IRR coefficients ranged from 0.04 (\( \kappa_{\text{Fleiss}} \)) to 0.92 (Finn; \( M = 0.26, SD = 0.24 \)), but from 0.10 (\( \alpha_{I} \)) to 0.92 if only interval-level IRR coefficients are considered. For the interval-level dataset Anxiety, the IRR coefficients ranged from −0.04 (\( \kappa_{\text{Fleiss}} \)) to 0.54 (W; M = 0.22, SD = 0.21), but from 0.00 (\( PA_{I} \)) to 0.50 (Finn₂) if only interval-level IRR coefficients are considered.

Table 3 IRR estimates for 20 coefficients on 4 datasets

Full size table

Table 3 (cf. the asterisks next to the values) also shows that the interpretation of the IRR of a dataset by means of the benchmarks of Landis and Koch (1977) depends on the choice of coefficient. For the dataset Diagnoses, the IRR could be labelled either slight, fair, or moderate; for the dataset Vision, the IRR could be labelled either moderate, substantial, or almost perfect; for the dataset Video, the IRR could be labeled anywhere from slight to almost perfect; and for dataset Anxiety, the IRR could be labelled either poor, slight, fair, or moderate.

For 13 of the 20 coefficients, standard errors were available (Table 2). To the best of our knowledge, for the other coefficients, standard errors are not available. For nine coefficients, a test statistic is available that tests whether the coefficient equals zero.

Although no dataset contained missing values, it is worth noting that the package irr handles missing data differently for different coefficients. Coefficients \( \alpha_{N} \), \( \alpha_{O} \), and \( \alpha_{I} \) use all available data by counting disagreements among any observed pair of ratings on the same subject (i.e., pairwise deletion). Coefficients \( \iota_{N} \) and \( \iota_{I} \) do not allow missing ratings (i.e., the software will return a missing value for the coefficient when any ratings are missing), whereas all other coefficients handle missing data by listwise deletion.

4 Discussion

The results showed that the coefficients provide very different numerical values when applied to the same dataset. Depending on the choice of the coefficient, the IRR label for a single dataset can range from poor to almost perfect. This seriously questions the usefulness of IRR coefficients. We limited ourselves to coefficients available in the R packages irr (Gamer et al. 2012), so the ranges may be even wider if more coefficients were included. This problem should be investigated further.

The usefulness of the coefficients in this paper can be investigated only if IRR has a sound definition; however, a clear definition seems to be absent. Some coefficients (e.g., the ICC) are based on variance decomposition, which is compatible with the framework of generalizability theory (e.g., Vangeneugden et al. 2005), whereas other coefficients (e.g., \( PA \)) are derived from the concept of literal agreement. Coefficients that stem from different conceptualizations of IRR cannot all measure the same thing. In a recent discussion with Feng (2015), Krippendorff (2016) wrote: “I contend Feng discusses reliability measures with seriously mistaken conceptions of what reliability is to assure us of” (p. 139). We need to distinguish the different theories behind the IRR coefficients and come up with a more accurate terminology to identify competing conceptualizations of IRR. Only if the theories and models behind IRR are sorted out, we can start investigating why some IRR coefficients produce higher values than others, and we can separate the wheat from the chaff. In that respect, we believe the work of Zhao et al. (2013) is a valuable contribution. They explain, for example, the flaws of chance-corrected coefficients such as \( \kappa \). Once we have selected estimates for different conceptualizations of IRR, we can deal with other issues identified in this study.

Another major problem is that few coefficients can handle missing data. This is problematic because ratings in the social and behavioral sciences can be expensive. For example, an assessment of a juvenile delinquent by an officer of Child Protection Services in The Netherlands (see our Introduction) takes approximately 6–8 h. A study investigating the IRR must allow for planned missingness because it is financially and practically impossible to have all officers assess all juvenile delinquents. Hence, a useful coefficient must be estimable with missing data.

We also found that for some coefficients, standard errors and confidence intervals cannot be computed and null-hypothesis testing is impossible. These standard errors, confidence intervals, and hypothesis tests should first be derived. Then the bias of all standard errors, the coverage of all confidence intervals, and the Type I error rate of all hypothesis tests should be investigated.

Finally, we used the benchmarks of Landis and Koch (1977). These benchmarks are considered to be the single most often used benchmarks (e.g., Gwet 2014, p. 164). The 42,000+ citations of the Landis and Koch paper on Google Scholar indicate at least their widespread use. A relevant question may be whether these benchmarks, which were designed for \( \kappa \), can be used for coefficients stemming from different conceptualizations of IRR. In future research, it should be investigated whether different sets of heuristic rules should be provided for different types of coefficients.

References

Bhapkar, V. P. (1966). A note on the equivalence of two test criteria for hypotheses in categorical data. Journal of the American Statistical Association, 61, 228–235. https://doi.org/10.2307/2283057.
Article MathSciNet MATH Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. https://doi.org/10.1177/001316446002000104.
Article Google Scholar
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220. https://doi.org/10.1037/h0026256.
Article Google Scholar
Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322–328. https://doi.org/10.1037/0033-2909.88.2.322.
Article Google Scholar
Eliasziw, M., Young, S. L., Woodbury, M. G., & Fryday-Field, K. (1994). Statistical methodology for the concurrent assessment of interrater and intrarater reliability: Using goniometric measurements as an example. Physical Therapy, 74, 777–788. https://doi.org/10.1093/ptj/74.8.777.
Article Google Scholar
Feng, G. C. (2015). Mistakes and how to avoid mistakes in using intercoder reliability indices. Methodology, 11, 13–22. https://doi.org/10.1027/1614-2241/a000086.
Article Google Scholar
Finn, R. H. (1970). A note on estimating the reliability of categorical data. Educational and Psychological Measurement, 30, 71–76. https://doi.org/10.1177/001316447003000106.
Article Google Scholar
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382. https://doi.org/10.1037/h0031619.
Article Google Scholar
Gamer, M., Lemon, J., & Fellows, I., & Singh, P. (2012). irr: Various coefficients of interrater reliability and agreement [computer software]. https://CRAN.R-project.org/package=irr.
Gwet, K. L. (2014). Handbook of inter-rater reliability (4th ed.). Gaithersburg, MD: Advanced Analytics, LLC.
Google Scholar
Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8, 23–34. http://www.tqmp.org/RegularArticles/vol08-1/p023/p023.pdf.
Article Google Scholar
Janson, H., & Olsson, U. (2001). A measure of agreement for interval or nominal multivariate observations. Educational and Psychological Measurement, 61, 277–289. https://doi.org/10.1177/00131640121971239.
Article MathSciNet Google Scholar
Kendall, M. G. (1948). Rank correlation methods. London, UK: Griffin.
MATH Google Scholar
Krippendorff, K. (1980). Content analysis: An introduction to its methodology. Beverly Hills, CA: Sage.
MATH Google Scholar
Krippendorff, K. (2016). Misunderstanding reliability. Methodology, 12, 139–144. https://doi.org/10.1027/1614-2241/a000119.
Article Google Scholar
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174. https://doi.org/10.2307/2529310.
Article MATH Google Scholar
Light, R. J. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76, 365–377. https://doi.org/10.1037/h0031643.
Article Google Scholar
Maxwell, A. E. (1970). Comparing the classification of subjects by two independent judges. British Journal of Psychiatry, 116, 651–655. https://doi.org/10.1192/bjp.116.535.651.
Article Google Scholar
Pearson, K. (1895). Notes on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58, 240–242. http://www.jstor.org/stable/115794.
Article Google Scholar
Popping, R. (1988). On agreement indices for nominal data. In W. E. Saris & I. N. Gallhofer (Eds.), Sociometric research (pp. 90–105). London, UK: Palgrave Macmillan. https://doi.org/10.1007/978-1-349-19051-5_6.
Chapter Google Scholar
Rhemtulla, M., Brosseau-Laird, P. É., & Savalei, V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods, 17, 354–373. https://doi.org/10.1037/a0029315.
Article Google Scholar
Robinson, W. S. (1957). The statistical measurement of agreement. American Sociological Review, 22, 17–25. http://www.jstor.org/stable/2088760.
Article Google Scholar
Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101. https://doi.org/10.2307/1412159.
Article Google Scholar
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlation: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428. https://doi.org/10.1037/0033-2909.86.2.420.
Article Google Scholar
Stuart, A. (1953). The estimation and comparison of strengths of association in contingency tables. Biometrika, 40, 105–110. https://doi.org/10.2307/2333101.
Article MathSciNet MATH Google Scholar
Van der Put, C. E., Spanjaard, H. J. M., van Domburgh, L., Doreleijers, T. A. H., Lodewijks, H. P. B., Ferwerda, H. B., et al. (2011). Ontwikkeling van het Landelijke Instrumentarium Jeugdstrafrechtketen (LIJ) [development of the national assessment procedure for youth criminal justice]. Kind & Adolescent Praktijk, 10, 76–83. http://www.tqmp.org/RegularArticles/vol08-1/p023/p023.pdf.
Article Google Scholar
Vangeneugden, T., Laenen, A., Geys, H., Renard, D., & Molenberghs, G. (2005). Applying concepts of generalizability theory on clinical trial data to investigate sources of variation and their impact on reliability. Biometrics, 61, 295–304. https://doi.org/10.1111/j.0006-341X.2005.031040.x.
Article MathSciNet Google Scholar
Zhao, X., Liu, J. S., & Deng, K. (2013). Assumptions behind intercoder reliability indices. Annals of the International Communication Association, 36, 419–480. https://doi.org/10.1080/23808985.2013.11679142.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Research Institute of Child Development and Education, University of Amsterdam, P. O. Box 15776, 1001 NG, Amsterdam, The Netherlands
Debby ten Hove, Terrence D. Jorgensen & L. Andries van der Ark

Authors

Debby ten Hove
View author publications
You can also search for this author in PubMed Google Scholar
Terrence D. Jorgensen
View author publications
You can also search for this author in PubMed Google Scholar
L. Andries van der Ark
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to L. Andries van der Ark .

Editor information

Editors and Affiliations

Umeå School of Business, Economics and Statistics, Umeå University, Umeå, Sweden
Marie Wiberg
Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, Illinois, USA
Steven Culpepper
Faculty of Psychology and Educational Sciences, KU Leuven, Leuven, Belgium
Rianne Janssen
Faculty of Mathematics, Pontificia Universidad Católica de Chile, Santiago, Chile
Jorge González
Department of Psychology, University of Amsterdam, Amsterdam, The Netherlands
Dylan Molenaar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

ten Hove, D., Jorgensen, T.D., van der Ark, L.A. (2018). On the Usefulness of Interrater Reliability Coefficients. In: Wiberg, M., Culpepper, S., Janssen, R., González, J., Molenaar, D. (eds) Quantitative Psychology. IMPS 2017. Springer Proceedings in Mathematics & Statistics, vol 233. Springer, Cham. https://doi.org/10.1007/978-3-319-77249-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-77249-3_6
Published: 21 April 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77248-6
Online ISBN: 978-3-319-77249-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

On the Usefulness of Interrater Reliability Coefficients

Abstract

Similar content being viewed by others

Statistical Assessment of Agreement

Intraclass correlation coefficients: clearing the air, extending some cautions, and making some requests

Disagreement Plots and the Intraclass Correlation in Agreement Studies

Keywords

1 Introduction

2 Methods

2.1 Data

2.2 IRR Coefficients

2.3 Analyses

3 Results

4 Discussion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

On the Usefulness of Interrater Reliability Coefficients

Abstract

Similar content being viewed by others

Statistical Assessment of Agreement

Intraclass correlation coefficients: clearing the air, extending some cautions, and making some requests

Disagreement Plots and the Intraclass Correlation in Agreement Studies

Keywords

1 Introduction

2 Methods

2.1 Data

2.2 IRR Coefficients

2.3 Analyses

3 Results

4 Discussion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation