Introduction

Scientific progress is a fundamentally social process. Research not only always builds on the work of others, their ideas and findings (Hardwig 1985), but it also particularly benefits from different viewpoints and strategies (Kitcher 1990), as well as from empirical investigation and critical discussion (Popper 1968). Peer review has introduced the social element into the publication process (Fiske and Fogg 1990). That is, examination and discussion of a researcher’s work by colleagues has already been carried out before it is made accessible to the whole scientific community. Today, most scientific disciplines trust in this kind of quality control (Hemlin and Rasmussen 2006). The underlying rationale of this procedure was to avoid errors and to ensure a certain quality of the publications (Bailar and Patterson 1985; Church et al. 1996; Cornforth 1974). In order to meet the challenges of this responsibility, a certain shared understanding of what characterizes quality seems necessary. Much research has already examined inter-rater reliability, also called inter-rater agreement, and provided a rather pessimistic picture.

In an effort to expand upon prior research, the paper presented here has the following three major research goals: First, we investigate whether prior findings of poor inter-rater reliability are generalizable to the interdisciplinary context. Our second major goal is to discuss and apply adequate methods for ill-structured measurement designs and thereby take reviewer discipline into account. The third major goal is to examine the underlying structure of the ratings of specific paper characteristics and to explore their potential to predict the citation rate of published papers.

To accomplish these goals, the paper is structured as follows: First, we outline previous research on inter-rater reliability. Second, we turn to methods of analysis used in prior research, with their limitations and possible solutions. Third, we summarize research on dimensionality and construct validity of peer-review ratings. In addition, we also consider prior research into criterion validity with regard to the predictability of the citation rate by reviewer recommendations. Then, we describe the data on which our analyses are based. Finally, we report our results, which we subsequently discuss in the light of previous research and practical implications.

Inter-rater reliability in interdisciplinary peer-review processes

To date, numerous studies have examined inter-rater reliability among reviews of different scientific products, such as abstracts (e.g., Blackburn and Hakel 2006; Cicchetti and Conn 1976; Rubin et al. 1993), grant proposals (e.g., Cole et al. 1981; Jayasinghe et al. 2003; Marsh et al. 2007) and scientific papers (e.g., Gottfredson 1978; Marsh and Ball 1981; Wood et al. 2004), as well as applications to scholarships (Bornmann and Daniel 2005). Research clearly indicates that evaluations of the same scientific manuscript differ substantially among reviewers. That is, the level of inter-rater reliability is quite low (Bornmann et al. 2010; Campanario 1998; Cicchetti 1991; Lindsey 1988). Accordingly, and consistent with previous literature reviews (Cicchetti 1991; O’Brien 1991), a recent meta-analysis based on 48 studies (and 70 reliability coefficients) concluded that inter-rater reliability “is quite limited and needs improvement” (mean ICC = .34, mean Cohen’s Kappa = .17; Bornmann et al. 2010, p. 1; for a critical view on the Kappa coefficient, see Baethge et al. 2013).Footnote 1

Although different scientific disciplines have been investigated (e.g., medicine, psychology, and sociology) and partially compared with each other (e.g., Kemper et al. 1996; Mutz et al. 2012), research on inter-rater reliability in an interdisciplinary field is scarce. One notable exception was the study by Langfeldt (2001). This study, however, applied mainly qualitative methods and did not provide quantitative indices of inter-rater reliability. In some other studies it is not entirely clear whether their analyses involved different disciplines (e.g., Herzog et al. 2005; Marsh et al. 2007). In any case, none of them has examined whether a match or a mismatch between reviewer discipline and paper discipline mattered with regard to the inter-rater reliability. This is, however, highly relevant with regard to the rapid increase of interdisciplinary research (Qiu 1992; van Noorden 2015). The corresponding challenge is that different disciplines may utilize different methodological approaches (e.g., quantitative vs. qualitative methods; hypothesis guided experiments vs. explorative descriptions; Platt 1964). Ultimately, such different approaches may be reflected in different standards for the evaluation of scientific contributions. Therefore, it was the first main objective of the present paper to investigate inter-rater reliability in an interdisciplinary scientific context. To this end, we analyzed proceedings submitted to an international interdisciplinary conference at the interface of social sciences (e.g., education), natural sciences (e.g., psychology) and technological sciences (e.g., information technology).Footnote 2

Methods for assessing inter-rater reliability

Complex data structures are typical for the analysis of inter-rater reliability in peer-review contexts. The second major goal of our study was, therefore, to discuss and apply an adequate method for dealing with such highly complex data structures. Prior research has mainly relied on the classical multitrait-multimethod (MTMM) approach by Campbell and Fiske (1959). In such designs, each target needs to be measured by each method. In reviewing papers, that means that each paper submission should be rated by every reviewer. Consequently, papers and reviewers had to be fully crossed (see Putka et al. 2011). Such fully crossed designs, however, are rare in peer-review contexts.

Studies in which submissions to scientific journals have been analyzed are common (e.g., Bornmann and Daniel 2008b; Howard and Wilkinson 1998; Kirk and Franke 1997; Petty et al. 1999; Scarr and Weber 1978; Scott 1974). In such scenarios, editors typically prefer reviewers that are experts in the field of a submitted manuscript. As a consequence, the overall design is far from a fully-crossed design. Nested designs, in which “each target is rated by a unique, non-overlapping set of raters” (Putka et al. 2008, p. 960) might come closer to reality. However, even this is more an exception than the rule. In most cases, some reviewers evaluate more than one submission. This is especially true for conferences where a limited number of reviewers typically evaluate a subset of all submissions (e.g., Rubin et al. 1993). Thus, many practical scenarios only provide ill-structured measurement designs (ISMDs) in which “ratees and raters are neither fully crossed nor nested” (Putka et al. 2008, p. 960). Similar problems associated with ISMD are also known with regard to traditional nominal scale agreement coefficients (Cohen 1960; Fleiss 1971; see also Baethge et al. 2013; Uebersax 1982–1983).

Some previous studies have tried to solve the ISMD problem by randomly selecting a certain number of reviewers per target and “arbitrarily identifying those raters as ‘Rater 1’ and/or ‘Rater 2’ for each ratee” (Putka et al. 2011, p. 506; e.g., see Marsh and Ball 1989, p. 157; Petty et al. 1999, p. 192). This procedure is associated with various problems, however, such as possible identification problems, inappropriate solutions, and data loss (e.g., see Brown 2015; Eid 2000). Moreover, it raises the question about the meaning of “Rater i” and “Rater j” in such models. Most critical, however, is that researchers can subsequently come to different findings and conclusions simply as a result of differences in the rater selection and assignment (Putka et al. 2011).

There are also drawbacks with more traditional estimators of inter-rater reliability when it comes to ISMD scenarios. Putka et al. (2008) showed that for Pearson correlation approaches as well as for conventional intraclass correlation coefficients (ICCs; e.g., McGraw and Wong 1996; Shrout and Fleiss 1979), each of these methods may systematically underestimate inter-rater reliability in ISMD scenarios. The magnitude of this bias depends on the specific design conditions (Putka et al. 2008).

With reference to the generalizability (G) theory (e.g., Brennan 2001) and as an alternative to the strategies described above, Putka et al. (2008) offer the G-coefficient G(q, k) as a “new interrater estimator that can be used regardless of whether one’s design is crossed, nested, or ill-structured” (p. 977). Parameter k is the number of reviewers per paper. In ISMD scenarios, k can be estimated by the harmonic mean (HM) of the number of raters per rate. Parameter q scales the variance proportion that is related to the rater main effects (Putka et al. 2008, p. 963). In fully nested designs, parameter q equals 1/k, in ISMD scenarios it is always smaller than 1/k, and in fully crossed designs it equals zero. With regard to the average ICCs (see Shrout and Fleiss 1979; McGraw and Wong 1996), G(q = 1/k, k) equals ICC(1, k) and G(q = 0, k) equals ICC(C, k). All these coefficients estimate the reliability of target scores which are derived by aggregating the ratings of k raters.

Analogously, setting k to the value of one in the equations described by Putka et al. (2008, p. 963), G(q, 1) allows for estimating a single-rater reliability (D. J. Putka, personal communication, December 15, 2015). Single-rater coefficients refer to the reliability of a target score that is derived from only one rater. Such single-rater ICCs were used in Bornmann et al.’s (2010) meta-analysis for inter-rater reliability of journal peer reviews. Usually, a single-rater reliability should be smaller than G(q, k ≥ 2). This is analogous to the phenomenon that adding items to a test can, if certain assumptions are met, improve the test reliability (e.g., Raykov and Marcoulides 2011; Wirtz and Caspar 2002; Yousfi 2005). Putka et al. (2008) recommend, however, that coefficients G(q, 1) and G(q, k ≥ 2) should be separately estimated, that is, without using the Spearman-Brown prophecy formula.Footnote 3

Based on a Monte Carlo simulation, Putka et al. (2008) conclude that “traditional estimators are either inappropriate or do not provide the most accurate result” (p. 980). They recommend the G-coefficient as “an attractive option relative to traditional methods” (Putka et al. 2008, p. 978). In our paper we therefore made use of this method, which perfectly applies to our ill-structured data set (see below).

Validity of peer-review ratings

Despite the fact that several studies have assessed inter-rater reliability for various different measures of paper quality (e.g., originality and relevance), little research has addressed the dimensionality of the quality judgements itself. This is important with regard to two major aspects. On the one hand, there are questions about the paper characteristics that should be considered in peer-review processes. On the other hand, it tackles the issue of whether reviewers are able to differentiate among different aspects of paper quality or whether they are instead driven by an overall, general impression (e.g., a halo-effect, Thorndike 1920; see also Pulakos et al. 1986). Despite the practical relevance and the fact that many journals (and conferences) provide multiple rating dimensions for peer-review evaluations, the issue of dimensionality itself has received much less scientific attention than inter-rater reliability (Marsh et al. 2008).

When a scientific contribution is evaluated in terms of quality, the question arises as to how quality can be defined. This becomes even more important in interdisciplinary contexts, as scientific disciplines may differ substantially in what they value. At the same time, it becomes relevant whether the instruments that are employed (e.g., rating scales) indeed measure the hypothetical construct that was targeted (e.g., originality). It is also relevant whether multiple measures actually assess different constructs or whether these overlap to an extent that they could be subsumed under the same label. This is the matter of construct validity (Messick 1995; Strauss and Smith 2009), which is the “overarching principle of validity, referring to the extent to which a psychological measure in fact measures the concept it purports to measure” (Brown 2015, p. 187).

Two concepts of construct validity are important in this context: convergent validity and discriminant validity (Campbell and Fiske 1959; see also Brown 2015). Convergent validity means that measures (different methods or indicators) of the same construct should be highly interrelated. Discriminant (or divergent) validity means that measures of different constructs should not be interrelated. Hence, an important question is whether the quality of a scientific contribution is a unidimensional construct that can be summarized in one global evaluation score. The alternative view would argue that quality comprises multiple dimensions which should be considered separately. In other words, if reviewers are asked to rate the quality of a scientific contribution on various dimensions (e.g., relevance, soundness, and novelty), the question is, whether these dimensions indeed represent distinct constructs, which would suggest a multidimensional structure, or whether they all converge, suggesting a unidimensional structure.

Naturally, the answer to this question depends upon the rating dimensions employed. Unfortunately, to date there is hardly any universal consensus on which dimensions a review should be based (for some elaborations, see Chase 1970; Hemlin and Montgomery 1990). Cicchetti (1991) identified two aspects which are broadly accepted: the importance of the study to the field and the perceived adequacy of the research design. Journals, conferences, and funding agencies, however, often ask their reviewers to evaluate papers on many more rating dimensions, as can be seen in the studies on inter-rater agreement (e.g., Cicchetti and Conn 1976; Marsh and Ball 1989; Montgomery et al. 2002; Rubin et al. 1993; Scott 1974; Whitehurst 1983).

Unfortunately, however, only one of these studies has examined the dimensional structure of these ratings as well as other aspects of construct validity. Marsh and Ball (1989) found modest support for the distinctiveness of the four rating dimensions that they had extracted from a 21-item instrument (research methods, relevance to readers, writing style and presentation clarity, and significance/importance). Their analysis favored the multidimensional model over an alternative unidimensional model. Similarly, Petty et al. (1999) reported a better fit when a model was based on five dimensions (literature, theory, methodology, importance, and recommendation) compared to an alternative unidimensional model. On the other hand, Cicchetti and Conn (1976) found that certain single dimensions (originality, design-execution, importance, and overall scientific merit) correlated strongly with an overall score (.55–.96). However, they did not directly compare a multidimensional model with a unidimensional one.

It is clear there is still little evidence with regard to the question of whether the use of multiple dimensions in fact adds something unique to a general evaluation. Moreover, there is no evidence at all when it comes to the interdisciplinary context. In our study, we have addressed this gap and analyzed reviews of submissions to an interdisciplinary conference. Here, reviewers provided both an overall evaluation and ratings with respect to four specific rating dimensions (relevance, novelty, significance, and soundness). These dimensions have been employed or suggested in previous publications as well (e.g., Beyer et al. 1995; Campion 1993; Cicchetti and Conn 1976; Gilliland and Cortina 1997; Gottfredson 1978).

Another way of investigating the validity of the peer-review ratings is to look at the potential of the rating dimensions to predict the citation rate of the accepted papers. For example, Bornmann and Daniel (2008a) showed that papers accepted by a high-impact chemistry journal would get more citations than papers that were rejected and published elsewhere. Opthof et al. (2002) showed a positive and significant relationship between reviewers’ priority recommendations and papers’ citation count for three years after publication in a cardiology journal. A positive relationship between peer-review scores and citations rates also exists in the field of research project grants (Li and Agha 2015). However, in another medical journal, Baethge et al. (2013) did not find a significant relationship between reviewer recommendations and citation rate. A possible explanation could be that “accepted versions of manuscripts differ considerably from submitted versions” (Baethge et al. 2013, p. 6). In any case, it seems worth investigating the predictive criterion validity of different rating dimensions in an interdisciplinary peer-review context. For this investigation, we took papers into consideration that had been published in the conference proceedings (see footnote 2).

Taken together, our paper has the following three major goals: (1) analyze inter-rater reliability in an interdisciplinary context, across all paper-reviewer-combinations and separated for same-discipline versus different-discipline reviewers, (2) apply state-of-the-art methods of analysis to account for ill-structured measurement design, and (3) examine the dimensionality and validity of the different rating dimensions.

Methods

Our study analyzed the reviews of conference proceeding papers submitted to an international interdisciplinary conference (see footnote 2). The conference takes place annually and is interdisciplinary, with researchers from computer science, education, psychology, and communication science. It is of medium size (about 200–300 participants, about 100–200 submissions) and has an acceptance level of less than 30%. As such, the conference is a competitive, typically mid-size conference with an interdisciplinary topic. Papers which are accepted appear in a Springer book series (see footnote 2).

Papers and reviewers

A total of one hundred and seventy-four submissions (including keynotes) were listed in the conference system. For our analyses, we considered only those n = 145 submissions which had been rated by at least two reviewers. From these, a total of n ap = 82 submissions that had been accepted were later published in the conference proceedings (see footnote 2). Overall, m = 130 reviewers conducted reviews of the n = 145 submissions. This resulted in a total of v = 443 reviews.

Due to the fact that reviewers could opt for the papers they would like to review, the number of reviewers per paper varied. Each selected paper, on average, received M = 3.06 reviews (SD = 0.40), with a minimum of 2 and a maximum of 5 reviews. Each of the corresponding reviewers (m = 130) reviewed, on average, M = 3.41 papers (SD = 1.90), with a minimum of 1 and a maximum of 6 reviewed papers.

Papers as well as reviewers were categorized by two independent raters into one of three different disciplines: (a) psychological-experimental, (b) empirical-social, or (c) information technological (Cohen’s Kappa = .86, disagreements were solved by discussion). Of all the papers, n psy.exp = 14 (9.7%) were regarded as psychological-experimental, n emp.soc = 51 (35.2%) were regarded as empirical-social, and n it = 80 (55.2%) were regarded as information technological. A similar distribution appeared on the reviewers’ side: k psy.exp = 13 (10.0%) were considered as psychological-experimental, k emp.soc = 32 (24.6%) as empirical-social, and k it = 85 (65.4%) as information technological. Altogether, n both = 93 (64.1%) papers were reviewed by both same-discipline and different-discipline reviewers, n same = 26 (17.9%) papers were reviewed only by same-discipline reviewers, and n diff = 26 (17.9%) papers were reviewed only by different-discipline reviewers. Again, a similar pattern appeared on the reviewer’s side: k both = 72 (55.4%) reviewed both same-discipline and different-discipline papers, k same = 31 (23.8%) reviewed only same-discipline papers, and k diff = 27 (20.8%) reviewed only different-discipline papers.

Units of analysis

The primary units of analysis were reviews and, on an aggregate level, papers. Review scores and paper scores were calculated for the following five rating dimensions: (a) overall evaluation, (b) relevance to the conference (c) novelty, (d) significance, and (e) soundness. For each paper and each dimension, the paper score was estimated by averaging the ratings (a) of all reviewers, (b) only of same-discipline reviewers (same-discipline paper scores), and (c) only of different-discipline reviewers (different-discipline paper scores). Thus, the paper score for a certain paper for a certain dimension was the mean of the ratings over all reviewers who had rated that paper (for the concept of target and rater scores, see, for example, Hönekopp 2006; Hönekopp et al. 2006).

Measures and variables

A review form guided reviewers in their evaluations. They were asked to provide a detailed review including justification for their scores. They were urged to be constructive and to answer first some open-ended questions (see Table 1). More important to the purpose of this study, reviewers were then asked to fill out several rating scales (see Table 1).

Table 1 Guide for reviewers’ evaluations

Our analyses focused on the following five variables: overall evaluation, relevance, novelty, significance, and soundness. The values for the overall evaluation ranging from −2 to 2 and were recoded by adding the value of 3 to the range from 1 to 5. In order to analyze only genuine evaluations, we eliminated all of the ratings which fell into the cannot judge response category for the variables relevance, novelty, significance, and soundness. Thus, for these variables, the analyses are based on recoded 3-point scales (e.g., 1 = not relevant, 2 = some relevance, 3 = highly relevant).Footnote 4 Due to the resulting missing values, this step resulted in an even more complicated data structure. Nevertheless, the analyses we applied were still appropriate for the remaining data.

A further aim of the study was to estimate the relationship between the rating dimensions and the citation rate of the papers published in the subsequent conference proceedings (see footnote 2). From the n = 145 submissions which had been rated by at least two reviewers, a total of n ap = 82 submissions were published. We counted the number of citations per paper by searching all databases of the Thomson Reuters Web of Science citation index for each published conference paper for a time frame of roughly three years. Thus, the citation window includes the first three years after publication (in addition to the end of the year in which the conference proceedings had been issued). This time frame is comparable to citations windows which were used in other studies (e.g., see Bornmann and Daniel 2008a).

Analyses

For estimating the G-coefficients, it was necessary to estimate the following variance components: (a) the paper main effect, (b) the reviewer main effect, and (c) the combination of error variance with paper × reviewer interaction effects. We used IBM SPSS Statistics Version 20.0 (2011) for this task. Variance components were estimated with the restricted (or residual) maximum likelihood (REML) estimator (e.g., O’Neill et al. 2012; Putka et al. 2008; see also Searle et al. 1992). Estimation of G(q, k) also requires estimates of the parameters k and q.

All other analyses were conducted with Mplus 7.3 (Muthén and Muthén 2012). With one exception, models and parameters were estimated with the robust maximum likelihood (MLR) estimator implemented in Mplus (for advantages of using a sandwich estimator, see Muthén and Muthén 2012; White 1980; Yuan and Bentler 2000). For significance testing purposes, several likelihood-ratio tests (LRTs) were also conducted. The reason for this choice was that, especially in small samples, the LRT is superior to the commonly used Wald test (e.g., Enders 2010). It must be noted, however, that LRTs based on the MLR estimator need to be corrected by special scaling factors (www.statmodel.com/chidiff.shtml; see also Enders 2010, p. 149). Missing data were dealt with by the full information maximum likelihood (FIML) method. This method uses all of the available information in the data (e.g., Enders 2001, 2010; Rubin 1976).

Beside the MLR estimator described above, a confirmatory factor analysis (CFA) was estimated with the Bayes estimator (e.g., see Kaplan and Depaoli 2013; Muthén 2010; van de Schoot et al. 2014; Zyphur and Oswald 2015). Missing data issues were dealt with in a similar fashion as with the FIML under the missing at random (MAR) assumption (Asparouhov and Muthén 2010; Enders 2010). For the Bayes estimation procedure, non-informative priors were used. The medians of the posterior distributions acted as point estimates of the parameters. The posterior distributions were estimated with the Markov chain Monte Carlo (MCMC) Gibbs sampler algorithm (e.g., see Brown 2015; Kaplan and Depaoli 2013; van de Schoot et al. 2014).Footnote 5

Consistent with the Bayesian philosophy, the Bayes estimator does not produce p values but Bayesian credibility intervals (CIs), which are not necessarily symmetric. A 95% CI means that there is “a 95% probability that the population value is within the limits of the interval” (van de Schoot et al. 2014, p. 844). Furthermore, if the CI does not contain the value zero, the corresponding parameter can be interpreted as significant according to classical frequentist null hypothesis testing (van de Schoot et al. 2014).

To address the multiple testing problem (e.g., Shaffer 1995), the raw p values for a given meaningful family of tests (e.g., all coefficients in a correlation matrix) were adjusted. In almost all cases, this was done by the Holm-procedure (Holm 1979). For such a test family, the conditional probability of committing a type I error at least once, that is, the multiple (familywise) significance level, is restricted to .05. With regard to Bayesian analyses, the multiple testing problem was addressed by using more conservative 99% CIs instead of 95% CIs.

Results

In the following section, we present our results with regard to the key aspects of inter-rater reliability and validity. First, we estimated the inter-rater reliabilities of the five rating dimensions, taking the ill-structured measurement design and the interdisciplinary context into account. Then, we investigated the dimensionality and the construct validity of the rating dimensions. Finally, we examined the predictive criterion validity of the rating dimensions in order to analyze whether the rating dimensions had the potential to predict the citation rate of accepted papers. In each of these steps, we differentiated between ratings that came from same-discipline reviewers and those that came from different-discipline reviewers.

Inter-rater reliability

Here, we primarily focused on the estimation of single-rater reliabilities G(q k , k = 1) for each dimension. Single-rater reliabilities were chosen, as they are comparable with the coefficients reported in the meta-analysis of Bornmann et al. (2010). Furthermore, for each dimension, we differentiated between paper scores that were based on the ratings of all reviewers and paper scores that resulted as a function of the match/mismatch between the discipline of the paper and the discipline of the reviewer.

Across all reviewers

Table 2 summarizes the data on which the G-coefficient estimations were based. Point estimates for the variance components are listed in Online Resource 2 and estimates for q and k are listed in Online Resource 3. Based on these values, the G-coefficients were calculated by applying the formula described in Putka et al. (2008). The single-rater reliabilities G(q k , k = 1) based on the ratings of all reviewers are shown in Table 3. As already mentioned, a single-rater reliability coefficient estimates the reliability of a paper score as if this score was based on only a single reviewer’s evaluation.

Table 2 Number of reviews, number of papers, number of different reviewers, and the harmonic mean (HM) of the number of reviewers per paper
Table 3 Estimated single-rater reliabilities G(q, k = 1) and confidence intervals as a function of the class of reviewers

The estimated single-rater reliability of the overall evaluation was .21 and the values for the other rating dimensions ranged from .17 to .28. Confidence intervals were estimated based on the Fisher z-transformation and on the corresponding back-transformation for ICCs (Fisher 1934; see McGraw and Wong 1996; Putka 2002). None of the Holm-adjusted confidence intervals (Holm 1979; see Altman and Bland 2011; Serlin 1993) contained the value of zero. Hence, agreement was significantly above chance. Nevertheless, agreement was low for all dimensions (below .40; e.g., see Cicchetti 1994, p. 286).

To compare our results for overall evaluation with those from Baethge et al. (2013; see footnote 1), we collapsed the five response categories into two categories: (a) weak or strong acceptance versus (b) all categories below weak acceptance (borderline, weak reject, and strong reject). Then, we estimated the AC1 coefficient for multiple raters (Gwet 2008, 2014; see Fleiss 1971) as well as AC1 coefficient and Cohen’s Kappa as calculated by Baethge et al. (2013) for more than two raters. The corresponding values were .14, .15, and .15, respectively. These chance-corrected estimates underscored the conclusions already drawn from the G-coefficients.

As a function of the match/mismatch between paper and reviewer disciplines

Table 3 also displays the single-rater reliabilities G(q k , k = 1) based (a) on the ratings of reviewers from the same discipline as the paper and (b) on the ratings from reviewers from different disciplines. The estimated single-rater reliabilities for the overall evaluation were again poor (same-discipline reviewers: .23, different-discipline reviewers: .18). “Fair” agreement (Cicchetti 1994) was found only for some of the single-rater reliabilities, based on the ratings from the different-discipline reviewers (.35–.46). In contrast, the agreement for the same-discipline reviewers was rather poor (.13–.20) and even non-significant. In any case, single-rater reliabilities were generally low—regardless of whether reviewers matched or did not match the papers’ scientific discipline.

In order to test for significant differences between same-discipline and different-discipline reviews, we estimated confidence intervals for the differences (a) by the method of Donner (1986, p. 76) for independent ICCs and (b) by the method of Ramasundarahettige et al. (2009, p. 1043, pp. 1045–1046) for dependent ICCs (for another approach:Footnote 6). All estimations were based on the Fisher z-transformation and its inverse (Fisher 1934). The confidence intervals for the differences are presented in Table 4. None of the comparisons yielded a significant difference, regardless of whether the two coefficients per rating dimension were defined as independent or as dependent. That is, although all same-discipline single-rater reliabilities were insignificant, and in most cases descriptively smaller than the corresponding different-discipline reliabilities, same- and different-discipline reliabilities did not significantly differ from one another.

Table 4 Two-sided 95% confidence intervals for the differences between single-rater reliabilities from same-discipline versus different discipline reviewers constructed (a) by the method of Donner (1986) for independent ICCs and (b) by the method of Ramasundarahettige et al. (2009) for dependent ICCs

In addition to the single-rater reliabilities we also estimated the k-rater reliabilities G(q k=1, k = HM), which were based (a) on the ratings from all reviewers, (b) on the ratings from same-discipline reviewers, and (c) on the ratings from different-discipline reviewers (see Fig. 1). Although the overall picture looks similar, it is obvious that the k-rater coefficients were larger than their corresponding single-rater counterparts (see Table 3). This is not surprising, because this phenomenon is analogous to the classical test theory, where more items usually result in higher reliability coefficients (e.g., see Yousfi 2005). Based on the ratings of all reviewers, reliability values can be regarded as “fair” (values above .40; Cicchetti 1994) for overall evaluation, relevance, and novelty. With regard to the k-rater reliabilities based on the ratings from different-discipline reviewers, coefficients can be regarded as “fair” for the dimensions relevance, novelty, significance, and soundness. In contrast, reliability values based on the ratings from same-discipline reviewers were poor for all dimensions (see Fig. 1). Finally, there was no evidence for a severity or a leniency bias in either the same-discipline sample or the different-discipline sample of our study.Footnote 7

Fig. 1
figure 1

Estimated k-rater reliabilities G(q k=1, k = HM) for the rating dimensions

In sum, the inter-rater reliabilities were generally low. Descriptively, the pattern was contrary to what we expected. That is, the reliabilities of paper scores based on ratings from same-discipline reviewers were descriptively lower than those based on ratings from different-discipline reviewers. These descriptive differences, however, did not reach significance.

Construct validity

We conducted several analyses to investigate the dimensionality and validity aspects of reviewers’ evaluations, at the same time taking scientific discipline into account. First, we estimated the manifest correlations of the paper scores based on different rater subgroups. Second, we conducted an exploratory factor analysis, which is followed by a more complex confirmatory factor analysis in the CT-C(M-1) framework (Eid 2000; Eid et al. 2003). Then, we examined whether a multidimensional model or a unidimensional model would better fit the data. Finally, we assessed the criterion validity of the rating dimensions for predicting the citation rate of accepted papers.

Manifest correlations

In a first step we determined the correlations between the paper scores based on the same-discipline reviewers and the paper scores based on the different-discipline reviewers. Such correlations between different paper scores can reveal, for example, whether different rater groups applied the same criteria and came to the same conclusions. Large positive correlations would imply that papers with comparatively high (low) positions in the ranking order based on the ratings given by one rater group should also have high (low) positions in the ranking order based on the ratings given by the other rater group (e.g., Henss 1992). In the MTMM approach by Campbell and Fiske (1959, p. 82) such correlations are termed monotrait-heteromethod values and were used to examine the convergent validity.

As Table 5 reveals, however, only correlations for the overall evaluation and for novelty reached significance. For the other three rating dimensions there were only small and non-significant relationships. It is noteworthy that the significant correlations obtained were only moderately positive (e.g., of medium effect size, Cohen 1988). In other words, reviewer groups (same-discipline vs. different-discipline) did not agree (much) on paper quality. This is evidence against a corresponding convergent validity, where the same-discipline paper scores and the different-discipline paper scores would have been expected to measure the same construct.

Table 5 Correlations between the paper scores based on the ratings of the same-discipline reviewers and the paper scores based on the ratings of the different-discipline reviewers (monotrait-heteromethod correlations)

Another picture appears if one looks at the correlations between the different rating dimensions within each reviewer group (heterotrait-monomethod values; see Campbell and Fiske 1959) and across all reviewers. Here, it is obvious that all correlations were (mostly) large, positive and significant (see Table 6). That is, a paper with a high (low) score on one rating dimension also has relatively high (low) scores on the other rating dimensions. This is evidence for rather low discriminant validity. Hence, reviewers (same-discipline and different-discipline alike) did not differentiate much in their assessments of the different paper attributes.

Table 6 Correlations between rating dimensions within both reviewer groups (heterotrait-monomethod correlations) and across all reviewers

Exploratory factor analysis

In a next step, we conducted an exploratory factor analysis (EFA) to examine the ostensible factor structure of the paper scores that resulted from same-discipline and different-discipline ratings. Input variables were 10 variables, 5 each in these categories: (a) the same-discipline paper scores for each rating dimension and (b) the different-discipline paper scores for each rating dimension (see Table 7). Papers (n = 145) were the units of analysis. The EFA was estimated by MLR with the full information maximum likelihood (FIML) method. The Kaiser–Meyer–Olkin (KMO) measure of sampling adequacy of the correlation matrix, which should not be smaller than .50, reached a value of .77 (Dziuban and Shirkey 1974; Kaiser 1970; Kaiser and Rice 1974; see also Field 2009; Hutcheson and Sofroniou 1999). The measures of sampling adequacy (MSAs) for each single variable were also greater than .50 (see Table 7).

Table 7 Measure of sampling adequacy (MSA), communality values (COM), and factor loadings for an exploratory factor analysis (EFA)

The number of factors were determined by the scree plot of the eigenvalues (based on the non-reduced correlation matrix: 3.90, 2.46, 0.77, 0.71, 0.57, 0.54, 0.37, 0.29, 0.20, 0.18) which is shown in Online Resource 5 (Cattell and Jaspers 1967; but see also Cattell 1966). Applying this criterion, the EFA suggests a two-factor structure. This decision was clearly supported by a more accurate parallel analysis (Horn 1965) and by the criterion that only those eigenvalues should be considered that lie above the 95th percentile of randomly generated eigenvalues (see Hayton et al. 2004; 1000 random data sets were generated; see Online Resource 5). After the extraction of two factors, the oblique Geomin rotation (Yates 1987) was applied. Both factors correlated significantly at .29 (p = .003).

As Table 7 shows, all of the same-discipline indicators only loaded significantly on factor F-I and all of the different-discipline indicators only loaded significantly on factor F-II. Therefore, it seems that this two-factor solution was an example of artificial method factors which, for instance, are typical of applications where positively and negatively formulated (reversed) items create their own factors (e.g., Brown 2015).

Again, these results provide evidence against the convergent validity of the paper scores from both reviewer groups, as well as against the discriminant validity of the five dimensions themselves. After all, the exploratory factor analysis did not reveal rating dimensions as distinct factors, as would have been expected if the rating dimensions had had a high convergent as well as discriminant construct validity. It is also notable that the overall evaluation indicators had the highest loadings in comparison to the other rating dimensions (see Table 7), which makes the overall evaluation a marker variable that is most essential for the interpretation of each factor.

Confirmatory factor analysis

In a next step we tested the dimensionality and method specificity of the ratings in a more elaborated way with a special kind of confirmatory factor analysis (CFA): a correlated traitcorrelated (method minus one) [CT-C(M-1)] model (Eid 2000; Eid et al. 2003). The corresponding model structure is illustrated in Fig. 2.

Fig. 2
figure 2

A correlated trait–correlated (method minus one) [CT-C(M-1)] model (Eid 2000; Eid et al. 2003) in which the same-discipline paper scores serve as reference method

Here, each rating dimension was given an exclusive factor with two indicators: (a) the paper scores that were estimated by averaging the ratings of same-discipline reviewers and (b) the paper scores that were estimated by averaging the ratings of different-discipline reviewers. The usage of two aggregated indicators per factor is comparable to the approach of using item parcels (e.g., test halves) as manifest indicators (e.g., see Brown 2015).

Additionally, an asymmetric method factor (MF) was specified, as was proposed by Eid (2000). The MF was uncorrelated with the other factors and only indicators from one method loaded on the MF. The method without loadings had to be interpreted as the reference method. That is, one method took on the reference role and acted as a comparison standard, whereby the method specificity of the other method was caught by the MF. Because such a model is therefore asymmetric and because there was no natural standard method, two models were estimated in which the roles of the same-discipline and the different-discipline scores were reversed. For scaling the latent variables and for identification purposes, one unstandardized loading per factor was fixed to the value of one (see Fig. 2).

However, such models can be prone to convergence problems (Eid 2000; Eid et al. 2003, p. 60). Therefore, and because of the relatively small sample size, the Bayes estimator was used. This estimator also has the advantage that implausible values (e.g., negative variances) are impossible (van de Schoot et al. 2014; Zyphur and Oswald 2015). Accordingly, the two models were estimated with the Bayes estimator by Mplus (Muthén and Muthén 2012).

The posterior predictive p value (PPP) was low for both models: .132 (same-discipline reviews as reference) and .191 (different-discipline as reference). PPPs should be around .50 and not very much smaller for models with an excellent fit (Muthén and Asparouhov 2011). However, the PPP should be regarded as a fit index and not as a statistical test. Moreover, the two models were already liberal and could be changed only with fit-reducing restrictions (e.g., equality constraints). Hence, the results of both models should not be ignored completely, but should be interpreted with great caution.

The estimated latent factor variances for both models are shown in Online Resource 6. It shows for each model that both variances for the rating dimension factors as well as the variance of the method factor were significant. The significant method factor variance provides support for considering the method factor in our model.

Table 8 shows the unstandardized factor loadings for both models (for the corresponding credibility intervals, see Online Resource 7). Interestingly, the indicators with free estimated loadings had significant loadings only on the method factor (if such paths were allowed). But in neither case were loadings on the rating dimension factors significant. Thus, the method factor seemed to be more important for an indicator variable than the corresponding rating dimension factor.

Table 8 Unstandardized factor loadings for a confirmatory factor analysis (CFA) correlated trait–correlated (method minus one) [CT-C(M-1)] model

This impression was emphasized by examining the coefficients of the CT-C(M-1) framework (Eid 2000; Eid et al. 2003). The reliability coefficient of an indicator variable provides information about the proportion of variance that is not attributable to random measurement error. For indicators which have additional loadings on a method factor, the reliability coefficient is divisible into the consistency coefficient and the method specificity coefficient which both sum up to the reliability. The consistency coefficient provides information about the measurement-error-free proportion of variance that is determined by the comparison standard component. The method specificity estimates the measurement-error-free proportion of variance that is method-specific and that is therefore not shared with the comparison standard (for detailed formula, see Eid 2000; Eid et al. 2003). Both the consistency coefficient and the method specificity coefficient can be defined (a) with the observed (manifest) variances as divisor or (b) with the measurement-error-free (true-score) variance as divisor. In the latter case, both coefficients add up to one.

All estimated coefficients for the model in which the same-discipline indicators act as a reference method as well as for the model in which the different-discipline indicators act as a reference method are listed in Table 9. As can be seen, particularly in the last two columns, it was nearly exclusively the method specificity which accounted for the measurement-error-free variance. As in the EFA, the relationships among the variables measured with the same method, that is, within each reviewer group, dominated the scenario. The fact that the method-specificity coefficients were much larger than the corresponding consistency coefficients indicates that the convergent validity was very low in all cases (Eid 2000; Eid et al. 2003). In other words, as found in the exploratory factor analysis, same-discipline reviewers and different-discipline reviewers did not agree in their evaluations.

Table 9 Coefficients for the confirmatory factor analysis (CFA) correlated trait–correlated method minus one [CT-C(M-1)] model

Likewise, the discriminant validity of all constructs is also in doubt. Although the latent (measurement-error-free) correlations (see Table 10) hardly ever reached the critical value of |.80| (or |.85|), which implies poor discriminant validity (e.g., Brown 2015, p. 28), it is obvious that the discriminant validity, in general, should be regarded as rather small. This corroborates the conclusion we reached through the inspection of the manifest correlations (see Table 6).

Table 10 Factor intercorrelations from the confirmatory factor analysis (CFA) correlated trait–correlated method minus one [CT-C(M-1)] model

However, even though the latent inter-correlations were relatively high, it must be taken into account that they were based on measurement-error-free variables. Thus, given the fact that latent correlations were mostly below .80, the results indicated that the rating dimensions could be empirically separated. That is, the reviewers’ evaluations of the different rating dimensions were quite similar but not identical. In other words, when several reviewers, for example, rated the novelty of a submission to be high (low), these same reviewers were also likely to rate the soundness of the same submission as rather high (low).

The conclusion about a low but serious discriminant validity was also supported by model comparisons between a unidimensional CT-C(M-1) model and a multidimensional CT-C(M-1) model for the dimensions relevance, novelty, significance, and soundness. Hence, a 1-factor model was compared to a 4-factor model. The deviance information criterion (DIC; Spiegelhalter et al. 2002) was used as a comparison criterion, whereby the model with a smaller DIC should be preferred (e.g., see Kaplan and Depaoli 2013). Models were estimated without considering overall evaluations. Table 11 shows the DIC values for both (a) models in which the same-discipline scores acted as reference method and (b) models in which the different-discipline scores act as reference method. In both cases, it seems clear that the DIC for the multidimensional model was smaller than the DIC for the unidimensional model. Thus, the supposed distinct dimensions relevance, novelty, significance, and soundness seem to be slightly better represented by a multidimensional model than by a unidimensional model.

Table 11 Relevance, novelty, significance, and soundness: comparison between unidimensional and multidimensional models

Predictive criterion validity: Citation rate prediction

The distinctiveness of the rating dimensions is also relevant for the final analysis, namely the examination of criterion validity of the rating dimensions. For this purpose, we analyzed whether the accepted papers’ citation rate was predicable by the rating dimensions.

From the n = 145 submissions which had been rated by at least two reviewers, a total of n p = 82 accepted submissions were published in the conference proceedings (see footnote 2). Altogether, we analyzed citations in a window of about three years. The mean citation rate was M = 1.24 (SD = 2.02). The lowest citation rate was 0 (n c0 = 40; 48.8%). The highest citation rate was 11 (n c11 = 1; 1.2%).

The citation rate, as a genuine count variable, was regressed on the five rating dimensions by applying negative binomial regression models (Hilbe 2011). Parameters were estimated by MLR with Monte Carlo integration (Muthén and Muthén 2012). All analyses were conducted for paper scores (a) based on the ratings from all reviewers, (b) based on the ratings from same-discipline reviewers, and (c) based on the ratings from different-discipline reviewers. Table 12 summarizes the results of these negative binomial regressions.

Table 12 Negative binomial regressions (NBRs) of citation rate on the five rating dimensions (with a citation window of roughly three years)

Based on the adjusted p values, it can be stated that significant partial effects on citation rate appeared only for the two rating dimensions relevance and novelty, and then only if the scores were based on the ratings from same-discipline reviewers. In addition, a significant effect of relevance also appeared when the paper scores based on the ratings of all reviewers were considered. For a more detailed interpretation, the rate ratio coefficients (from a multiplicative model) in Table 12 should be inspected. These coefficients were obtained by exponentiation of the slopes (see Hilbe 2011).

The multiplicative rate ratio coefficient of 4.24 for relevance in Table 12 (same-discipline paper scores) means that, if all other predictors were held constant, one unit increase in relevance increased the citation rate by a factor of 4.24. In other words, one unit increase of relevance ratings made by same-discipline reviewers would result in 324% more citations. On the other hand, the rate ratio coefficient of 0.28 for novelty (same-discipline paper scores) means that, if all other predictors were held constant, one unit increase in novelty decreased the citation rate by a factor of 0.28. That is, a one unit increase of novelty ratings made by the same-discipline reviewers would result in 72% fewer citations. Table 12 also shows that several slopes in the regressions with only one predictor were significant. This different result pattern could be an effect of the high correlations among the predictors.

In sum, our results demonstrated the predictive power of reviewers’ relevance and novelty ratings, provided that each reviewer belonged to the same discipline as the paper. These effects emerged even when the citation window was lengthened to a time span of seven years (see Online Resource 8).

Discussion

The study set out to investigate two facets of the quality of reviews in an interdisciplinary context: inter-rater reliability and validity. Taken together, our findings draw a somewhat pessimistic and, to some extent, mixed picture. Not only did we find little agreement among reviewers, our findings also argue for little convergent as well as discriminant construct validity. However, with regard to predictive criterion validity, we found that the ratings for relevance and novelty were capable of predicting the citation rate of the accepted papers which were published in the conference proceedings. These effects were restricted to ratings made by reviewers from the same discipline as the papers. Let us address each aspect in detail.

Poor agreement among reviewers

Across all reviewers (same-discipline and different-discipline) the agreement among reviewers was above chance, yet rather poor (Cicchetti 1994). This is a common finding, as indicated by the most recent meta-analysis (Bornmann et al. 2010; see also Cicchetti 1991). It seems noteworthy, however, that the average agreement based on all reviewers and across all five rating dimensions (.22) was even below the mean of ICCs and r 2 coefficients obtained in the meta-analysis (.34).

It would seem plausible to suspect that the interdisciplinary context accounts for these findings, since most prior studies on inter-rater reliability were conducted within single scientific disciplines. So, one could expect that shared standards would enhance agreement. Our results do not confirm this argument, however. A differentiation between intra- and interdisciplinary reviews even indicates the opposite. Whereas agreement was poor and not even distinguishable from chance for intra-disciplinary reviews (average single-rater reliability: .16), it was higher and well above chance for interdisciplinary reviews (average single-rater reliability: .35). This is interesting in consideration of the fact that the discipline of an “outsider” may not only differ from the paper’s discipline but also from other reviewers’ disciplines (e.g., when a paper from the psychological-experimental category is rated by one reviewer with informational-technological background and by another reviewer with social-educational background). To be sure, “higher” agreement in this case meant “fair” instead of “poor” agreement, not “good” or even “excellent” agreement (Cicchetti 1994).

It must also be acknowledged that the differences between interdisciplinary and intra-disciplinary reviews were not statistically significant. Therefore, the descriptively higher agreement in interdisciplinary reviews should not be overemphasized. It does make clear, however, that the overall poor agreement obtained in our study cannot be attributed to the interdisciplinary context. At the same time, those results throw into question the implicit assumption that intra-disciplinary inter-rater agreement is, a priori, higher than interdisciplinary inter-rater agreement. It is still far too early to draw any conclusions, since more research is needed in this area. After all, our study was the first to examine agreement among reviewers in an interdisciplinary context, so now further studies are needed. And we strongly recommend appropriate statistics for future studies, as ill-structured designs seem to be the norm rather than the exception in peer-review data sets (e.g., when analyzing submissions to scientific journals or meetings).

Even more importantly, however, more agreement is needed. By now, several researchers have criticized peer ratings as they “fall short of acceptable standards of reliability” (e.g., Marsh et al. 2007, p. 37). Of course, agreement is relevant and desirable only if one assumes that manuscripts possess an inherent objective quality (Kirk and Franke 1997). From the standpoint of rejecting this idea (e.g., Luce 1993) the very notion of quality control becomes irrelevant, and consequently peer review would not be needed to serve a gatekeeping function. As long as decisions for or against the acceptance of a submitted manuscript or grant proposal are based on peer reviews, however, “appreciable levels of agreement and a principled, valid basis for agreement” are necessary (Whitehurst 1983, p. 78; Burdock et al. 1963). After all, these decisions have an impact on the career of researchers (e.g., van Dalen and Henkens 2012). Therefore, research should not only tackle the status quo of interrater agreement but also how to improve it.

Poor construct validity of the reviewers’ evaluations

The findings of our various analyses regarding construct validity of the ratings draw a very coherent picture. On the one hand, they show evidence for poor convergent validity. First, manifest monotrait-heteromethod correlations of the same dimensions between different groups of raters were low. Second, the exploratory factor analysis yielded two factors instead of five. The pattern of the factor loadings clearly revealed that these two factors were not based on content but rather represented two method factors (one for the same-discipline reviews and one for the different-discipline reviews). Third, the confirmatory factor analysis also pointed to the differentiation between same-discipline and different-discipline reviews and revealed that the corresponding indicators did not significantly load on common rating dimension factors. Together with a large method factor variance and a high method-specificity coefficient, this indicates a poor convergent validity.

At the same time, our findings also suggest a rather poor discriminant validity. First, manifest correlations between different rating dimensions within each reviewer group (heterotrait-monomethod correlations) and across all reviewers were rather high (even if below .80). Second, the exploratory factor analysis yielded high loadings within the factors of each reviewer group. Third, the confirmatory factor analysis yielded high latent correlations between factors. However, most latent correlations were not as high as would have been expected if reviewers’ evaluations had been unidimensional (e.g., above .80). Similarly, our comparison between a unidimensional and a multidimensional model suggested that the different rating dimensions were closely related high but not close enough to suggest a unidimensional model. Rather, the multidimensional model yielded a better fit. In other words, despite the fact that we found only modest empirical support for the distinctiveness of the rating dimensions, it might still make sense to ask reviewers to take (those) different dimensions into account when evaluating a submission.

Obviously, the rating dimensions were similar, but not redundant. Consequently, they still provided more information than a single global rating would have, but they hardly reflected independently assessed dimensions. Here, our results are therefore similar to Marsh and Ball (1989), who concluded that “although there was support for the conceptual and empirical distinctiveness of four components, there was little support for their practical utility” (pp. 165–166).

Relevance and novelty as predictors for the citation rate

On the other hand, with regard to the prediction of citation rate, our findings point to a practical utility of distinct rating dimensions. That is, our findings demonstrate the predictive criterion validity of relevance and novelty ratings made by reviewers from the same discipline as the paper.

With regard to highly cited papers, it seems clear that “in order to get highly cited the content of the highly cited paper must be useful or of relevance for the research activity” (Aksnes 2003, p. 167). Accordingly, highly relevant papers, at least in the eyes of same-discipline reviewers, were cited more frequently than papers which received low relevance ratings. This finding is not trivial, as it indicates that same-discipline reviewers are in fact capable of evaluating the relevance of submissions for their scientific community. This does not mean, however, that attributed relevance automatically indicates objective quality which would later be reflected in the high resonance within the reviewers’ scientific community. Rather, it means that reviewers seem to have a good sense for which submissions would generate that kind of resonance. And this resonance takes, among others, the form of countable citations, which is a kind of scientific currency. However, the implications, shortcomings, and benefits of the resonance metaphor cannot be discussed and deepened here (for a positively connoted socio-psychological concept of resonance, see Rosa 2016; for a comparison of theoretical approaches to citation behavior, see Bornmann and Daniel 2008c).

The observable success or impact of highly relevant papers in terms of citations does not necessarily imply that especially highly innovative papers generate a lot of resonance. Quite the contrary, because another important finding was that papers with high (low) novelty ratings from same-discipline reviewers were cited to a lesser (higher) degree. This is in line with the findings of Stephan et al. (2017) who concluded that “more-novel papers were more likely to be either a big hit or ignored compared with non-novel papers in the same field” and that “novelty needs time” (p. 412; for some remarks on the conservative bias, see, for example, Benda and Engels 2011; Lee et al. 2013). On the other side, it has also to be considered that novelty by itself does not guarantee scientific quality.

Limitations

One criticism might be that the ratings per paper should not be aggregated at all due to the rather low k-rater reliabilities. And the poor reliability for single reviewers and for both reviewer groups might explain the appearance of the method factors that we observed in the exploratory factor analyses. Thus, the agreement was simply too low for distinct content factors to emerge. In this case, however, the conclusion would be even more negative, as we could not speak of low construct validity but would have to say that the results of an analysis of validity were difficult to interpret. The implication would be even more obvious: we urgently need a better agreement in peer reviews. This resembles the insight from classical test theory that reliability (more precisely, the reliability index) restricts the possible upper limit for validity (e.g., Raykov and Marcoulides 2011, pp. 193–194). Here again, it is even more important to not only investigate the peer-review system, but to try to improve it.

Another limitation which could be discussed regards the nature of our data as well as the nature of data from review processes in general. Except for the estimations of the inter-rater reliability, almost all analyses were based on the paper scores, just as analyses in psychological research have usually been based on participant scores (reviewers and items become the measurement instruments in both scenarios). These paper scores were constructed by averaging the ratings (a) of all raters, (b) of same-discipline raters, and (c) of different-discipline raters. Thus, these variables could be regarded as quasi-continuous. With respect to the non-aggregated ratings, the question whether such single-item rating scales can be treated as continuous variables in the analysis process is still a heavily debated issue between “purists” and “pragmatics” (Bortz and Döring 2006, p. 181). The pragmatic strategy seems to be justified for new research questions and for cases in which important and consistent result patterns are obtained, which are later replicated with more sophisticated methods (Bortz and Döring 2006, p. 182; see, for example, also Hassebrauck 1993; Rhemtulla et al. 2012). For example, single-item rating scales have been successfully applied to collect ratings of physical attractiveness of target persons (e.g., Hassebrauck 1983; Hönekopp 2006), to measure self-rated political ideology (e.g., Cohrs et al. 2005), or to collect impressions about the persuasiveness of presented material (e.g., Lord et al. 1979).

A further limitation could be that we used only Web-of-Science databases for counting the citations rates. However, for the time span of our investigations (see footnote 2), Web-of-Science could be seen as an appropriate interdisciplinary citation index that ensured a comparatively high accuracy standard when searching for citations in scientific writings (e.g., de Winter et al. 2014; see also Baethge et al. 2013). A methodical advantage was the fact that all of the papers we analyzed were published at the same time. An offset correction for different time spans was therefore not necessary.

Implications and outlook

Despite the limitations of our study, we regard the results to be worrisome. While consequences of low inter-rater reliabilities may be limited with regard to conference submissions, they are more serious when it comes to journal submissions and even more severe in the case of grants, which ultimately shape careers. Thus, it seems to be time to not only examine the reliability and validity of the peer-review system, but rather to think of better ways to assess submissions. Several suggestions have been put forward in the recent past.

It is possible that critical appraisal tools (CAT; for an overview see Crowe and Sheppard 2011a) could be important in this regard. CATs are standardized instruments that can be used for the evaluation of scientific documents, and they were designed mainly as a tool for systematic reviews. They allow for a thorough evaluation of research articles and enable identification of the best articles on a given topic (e.g., Crowe and Sheppard 2011a). However, evidence is sparse for the reliability and validity of such tools as these (Crowe and Sheppard 2011a, b).

Another obvious starting point might be to improve inter-rater reliability by training reviewers (Oxman et al. 1991). Such training, however, requires a shared understanding of both the criteria for paper quality and of when and to what extent the criteria are met. Hence, journal editors, conference organizers, and grant suppliers would need to agree on these issues in order to be able to provide detailed instructions for reviewers. It might be, however, that formal training would be less efficient than one would hope (Callaham and Tercier 2007; see also Houry et al. 2012). Moreover, some reviewers might feel their academic freedom is threatened by training (e.g., Adams 1991).

A further possibility for enhancing agreement is to avoid reviewers who have been nominated by the author (Marsh et al. 2007). The numbers of reviewers could also be increased in order to obtain more reliable results (Wood et al. 2004). Since an increase in reviewers would increase substantially the effort, this particular solution might be appropriate mostly for cases with more serious consequences (e.g., grants) in order to decrease the impact of chance (Cole et al. 1981; but see also List 2017).

Another approach to increasing reliability might be the “reader system”, which has been suggested by Jayasinghe et al. (2006). The most important aspects of this system are that the same three to four experts of a sub-discipline review all proposals and are asked to rank them. This proceeding is characterized by a shared frame of reference and eliminates rater-effects (leniency/harshness). The reader system has been developed with grant proposals in mind, all of which are supposed to be submitted at the same time. To apply this system to journals, which receive submissions on a continuous basis, however, would require making some adaptations. Applicability might not be realistic at all in cases where many submissions have to be evaluated.

Finally, open peer review has been suggested (Groves 2010). In this system, reviewers would not remain anonymous but would sign their reviews. Initial evidence suggests that open peer reviews are of higher quality (Walsh et al. 2000). We are not aware of any study which examines reliability or validity issues in open peer review, let alone one which provides a comparison to traditional (closed) peer review (but see DeCoursey 2006 and Khan 2010 for a discussion of advantages and disadvantages of open peer review).

Apart from thinking of variations of the traditional peer-review system, however, one might likewise think of alternatives to it. Low agreement is less of a problem, for instance, if peer review does not fulfill a gate-keeping function (for publication). Imagine a scenario where everything that was published had passed an initial screening (Wood et al. 2004). In this scenario, it would be essentially up to the entire scientific community to deal with the publication. How acceptable the publication proved to be would still be measurable by citations. Based on prior evidence that the impact in the scientific community is only loosely linked to reviewers’ evaluations (Akerlof 2003; Gottfredson 1978; Harrison 2004) such a scenario might be particularly interesting. Moreover, it could be combined with novel elements that have become possible with Web 2.0 and social media, such as reader evaluations or post publication peer review (but see Anderson 2012). Of course, we do not know yet whether such a system would be superior to the traditional peer-review system (Smith 2003). As long as there are no empirical comparisons, however, traditional peer review may simply survive because of the lack of good alternatives. This is, however, weak justification for its implementation.

With regard to validity, our results indicated a rather poor convergent validity, although the model comparison did favor a multidimensional model with distinct rating dimensions. Therefore, it seems too premature to conclude that distinct rating dimensions are unnecessary. This is especially true with regard to the dimensions relevance and novelty, which were predictive of the citation rate provided the ratings were made by reviewers from the same discipline as the papers. Future studies should, therefore, investigate the predictive criterion validity of different rating dimensions with regard to a broad range of criterion variables. Criterion variables could be operationalized, for example, by social network analysis (SNS) methods. Such variables could indicate how central an article is in a citation network and to what degree an article has linked different disciplines (e.g., Halatchliyski and Cress 2014). In addition, a simple but perhaps effective method to predict citation and download rates could be to ask reviewers directly to assess the papers’ potential for generating many citations and clicks. However, a legitimate question is whether the citation rate is actually an adequate proxy for scientific quality (e.g., Bornmann and Daniel 2008c; Lindsey 1989; Stephan et al. 2017; Tahamtan et al. 2016). But that is another story which is also worthy of more attention by future research.