Keywords

Modeling Peer Review

Research has long questioned the validity and reliability of peer review, the process for selecting manuscripts for publication and research proposals for funding. For example, scholars have shown that reviewers do not interpret evaluation criteria in the same way [1] and produce inconsistent ratings (Bornmann et al. [2], and that peer review is subject to gender, ethnicity, seniority, and reputation biases [9, 12].

Social simulation and agent-based models in particular have proven valuable tools to study the causes of-and seek remedies for- the issues with peer review [16]. This is due to three factors: (1) the complex nature of peer review systems, which are characterized by non-linear interdependencies between applicants, reviewers, and the institutions in which they are embedded; (2) the typically high cost and risk of testing interventions; and (3) the notorious scarcity of available data on peer review systems [8, 15].

Since 1969, when scholars have first turned to formal and computational modeling to study peer review [17], 44 modeling papers have been published on the subject. The current state-of-the art shows two lacunae: limited model integration and limited empirical calibration and validation [4].

This paper reports on work in progress aimed at filling in these gaps. In the context of a larger, mixed-method project on the peer review process at Science Foundation Ireland (SFI), we are integrating and building on existing simulation models of peer review to compare them and better connect them to empirical reality. In our work we focus on different aspects of peer review, one of which are aggregation rules. These rules define how the assessment by different reviewers and/or on different evaluation criteria can be combined into an aggregated score, a number which captures the overall worth of a submission. We will present our ongoing work on aggregation rules as example to illustrate the typical lacunae in simulation studies, and how a mixed-methods approach can help solve them.

Aggregation Rules in Simulation Literature

Several simulation studies have explicitly modeled aggregation rules. A complete review is provided in Feliciani et al. [4]; here we mention a few examples. In some models the aggregated score can be the median of the different review scores [11]; some other models take the mean of the scores—the mean can be weighted by the reviewers’ reputation [13] or complemented with information on the standard deviation of the individual scores [10].

First Issue: Limited Model Integration

The existing literature consists of an abundance of competing assumptions, alternative implementations, and unconnected findings, which result in the fragmentation of the landscape. The literature on aggregation rules is a prime example: only a few papers have compared more than one aggregation rule (the examples above have), and no one has attempted to implement aggregation rules proposed in previous work. This lack of integration among models and further development of existing models raises important concerns about their generalizability.

The way we are addressing this issue is by implementing the aggregation rules from the literature into our simulation model. By aligning these rules within a common simulation framework, we can test them against one another and find which ones (and under what conditions) are the best at maximizing common outcome metrics of peer review, like efficacy (i.e. the capability to filter out poor quality submissions) and efficiency (i.e. reducing costs).

Second Issue: Limited Use of Empirical Calibration and Validation

Despite many researchers advocating for more empirically calibrated and validated models [7], few have incorporated empirical evidence of peer review in their work. With a few exceptions (e.g. [5]), only a few models of peer review (or of aggregation rules specifically) have been calibrated or validated.

This may be due to different reasons. One reason could be the aforementioned scarcity of data on peer review systems. A second reason could be that, even when data are available, they often are of the qualitative kind, and there is no consensus on what are good practices in using qualitative data source to calibrate and validate models.

We argue that the use of both qualitative and quantitative evidence is necessary for the study of peer review; as such data is necessary to understand the formal rules and the actual practices of the peer review process. Further to that, quantitative data can be deployed for empirical testing of models’ predictions [6].

In our study of aggregation rules, we can rely on qualitative and quantitative data on the peer review process in two funding schemes at Science Foundation Ireland. We are using these data in two ways: first, to reproduce the conditions found in a real peer review process, and second, to test the effects of competing aggregation rules against empirically observable outcomes.

The data sources we have are at different levels of aggregation (funding calls, proposals, individual applicants and reviewers) with mixed quantitative and qualitative components (e.g. call documents, instructions, textual reviews, interviews with applicants, and so on). The use of qualitative data sources in particular leads to some challenges which are of common interest for modelers who work with these data types.

Use of Qualitative Data Sources

The first challenge concerns the formalization of the model—that is, the initial phase of model building where the modeler translates an informal description of a process into a formal system. In our case, formalization means translating reviewer guidelines and SFI regulation and guidance documents into code for the agent-based model.

The literature offers few examples of protocols or methods to produce code based on some kinds of qualitative data. One example is the Engineering Agent Based Social Simulation framework (EABSS), which demonstrates how model development can be driven by a focus group [14]. Others had interview data as starting point: interviews can be used to draw cognitive maps, and cognitive maps are implemented in the simulation environment to guide agents’ behavior [3].

A second challenge concerns the use of qualitative data for empirical validation of a simulation model. To our knowledge, the only way of testing numerically-expressed model predictions against qualitative data (e.g. a report by the chair of a SFI sitting panel) is to convert the qualitative data into quantities. For textual inputs (like reviews) we can do the conversion with a combination of manual coding and computational methods (e.g. natural language processing)—this, at least, is the approach we are taking to translate textual reviews into input for the empirical calibration of a simulation model of aggregation rules.

Conclusion

By taking our ongoing study of aggregation rules in peer review as an example, we have illustrated two common gaps in the modeling literature, why it is important that we address them, and how we can do it. One of the two gaps concerns the insufficient interface between models and the real world: we have argued that the use of mixed data sources can alleviate the issue, and we have summarized our strategies for implementing diverse data sources in the empirical calibration and validation of a simulation model.

To conclude, our modeling work on the peer review process at SFI has two ambitious objectives: (1) to test competing assumptions and modeling strategies, and (2) to pioneer the integration of qualitative evidence into a simulation model, for which standards have yet to emerge.