1 Introduction

For 25 years, countries have been trying to negotiate agreements to limit global emissions of greenhouse gases, and yet all this time emissions have continued to increase. The 2009 Copenhagen conference invited countries to submit quantified, nationally determined emission reduction targets aimed at limiting mean global temperature change to 2 °C. However, the submissions made subsequently fell far short of the levels needed to meet this goal (Rogelj et al. 2010), and so countries decided to negotiate a new agreement. In subsequent conferences, countries were urged to submit pledges for emission reductions known as “intended nationally determined contributions,” to include a reference point for the emission target and a time frame for meeting it. As the negotiations advanced, it became clear that the new treaty’s main novel feature would be a procedure for pledge and peer review, and the agreement ultimately adopted in Paris retains this feature. Negotiators have long appreciated the need for monitoring and verification (Thompson 2006), but as Aldy (2014: 283) has noted, the review process adopted by the UNFCCC before Paris did “not include a formal peer review mechanism.” Paris moved the review process a step closer in this direction. “In order to build mutual trust and confidence,” Article 13 of the Agreement establishes a “transparency framework” for the “tracking” of a country’s “progress towards achieving [its] individual nationally determined contributions,” with the information supplied being subject to a “technical expert review,” the purpose of which is to determine whether a party has achieved its “nationally determined contributions” and to identify “areas of improvement for the Party….” Moreover, the agreement requires that each party “participate in a facilitative, multilateral consideration of progress with respect to… implementation and achievement of its nationally determined contribution.” Article 14 goes on to say that parties shall also “periodically take stock of the implementation of [the] Agreement to assess [their] collective progress towards achieving” the 2 °C goal.

Will the new agreement work any better than the approaches tried previously? The end-dates for the intended nationally determined contributions declared in Paris occur in 2025 and 2030, so it will take a decade or more to know whether the pledges made there are actually fulfilled. Even then, estimation of the effect of the new agreement will be difficult, since we will never be able to observe the “counterfactual”—the emissions that would have come about had Paris never been negotiated. In other contexts, humans have been shown to be sensitive to social feedback, even when the feedback does not involve a direct cost (Masclet et al. 2003; López-Pérez and Vorsatz 2010; see the electronic supplementary material for a review). However, the climate problem differs in important ways from the settings studied previously.

Here we report the results of a new experiment designed to capture key features of the climate problem and the design of the new Paris Agreement. First, the players in our experiment can choose between “cheap” cooperation and “expensive” cooperation—a crucial distinction, as only expensive cooperation can stabilize concentrations and so reach the Paris Agreement’s collective goal of “[h]olding the increase in the global average temperature to well below 2 °C above pre-industrial levels.” Second, our game is played not only between individuals within a group but between the group and Nature. To avert a “catastrophic” outcome, players must undertake “expensive” cooperation, and the more they contribute collectively, the more they reduce the probability of triggering a “dangerous” outcome. Third, the players in our game choose more than their contributions. They also choose their collective goal (a value that is akin to the “carbon budget” associated with the 2 °C goal), which relates to the game they are playing against Nature (avoiding “dangerous” climate change), and their individual “pledges,” which are represented in our experiment by non-binding declarations about a player’s intentions to contribute in the future. A “review” in our experiment represents a judgment made by other players about an individual player’s behavior. The process of “pledge and review” in our game allows players to be “judged” for both their “ambition” (their individual pledges relative to the collective goal) and for their contributions relative to their pledges. Transparency about contributions is thus critical to the process of peer review. Finally, our experiment explores the implications of varying the timing at which a review takes place. Timing was clearly considered to be important to some negotiators, as an earlier draft of the treaty distinguished between an “ex ante” review, conducted after pledges had been submitted but before contributions were made, and a “strategic” review, undertaken after contributions had been made.Footnote 1 The final agreement uses different language but still emphasizes the need for “tracking of progress.”

Though a laboratory experiment obviously cannot tell us what will actually happen in the wake of Paris, it can provide a comparison between a situation without a review process and situations with a review process, and so can show whether a process of review causes proposals, pledges, and, most important of all, actual contributions to increase.

2 Experimental design

Our analysis is based on a laboratory experiment of a game played by groups of five players. In this game, every player is endowed with 5 black poker chips worth €.10 each and 15 red poker chips worth €1.00 each. Hence, the group has 100 chips overall (25 black chips and 75 red chips), and both types of chip can be invested to “mitigate climate change.” We can think of the black chips as a low-cost technology for “ordinary abatement” and the red chips as a high-cost technology for removing carbon dioxide from the atmosphere (Keith 2009). Contributions of either type of chip by any player gives every player in the group a return equal to €.05. This is the marginal benefit of avoiding “gradual” climate change. The game also involves “catastrophic” climate change, the avoidance of which is feasible but requires using both the low- and the high-cost technologies. If 50 or fewer chips are contributed overall, a threshold will be crossed, causing each player to lose €20. If the group contributes more than 50 chips, the probability of crossing the threshold declines linearly as more and more chips are contributed, reaching zero if and when all of the group members contribute all of their chips. To have any chance of avoiding “catastrophe,” the players must thus contribute expensive chips and not only their cheap chips. This game design makes contributing chips a prisoners’ dilemma: from the group’s perspective, it is best for everyone to contribute all of their chips but from any individual’s perspective it is best to keep all of his or her chips.Footnote 2

The experiment was presented in a neutral frame as regards context and language to avoid any potential bias; there was no mention of “climate change,” “cooperation,” or “catastrophe” (instructions can be found in the SI). The game was played in stages. First, individuals made “proposals” for a group target knowing that the median value would be selected as the group’s “target.” Second, each player pledged an amount he or she intended to contribute subsequently. Third, the players made their actual contributions over two stages. It was common knowledge that the targets and pledges were non-binding and that all values would be revealed to every member of the group after each stage. Because players were allowed to contribute over two stages, players could see how much their co-players had contributed in the first stage before deciding how much to contribute in the second stage.

The game just described represents the No-Review treatment in which the players lack an explicit mechanism for expressing their judgment about other players’ behavior. There were also three treatments that incorporated an explicit review process (see Fig. 1). In each of these treatments, every player “graded” all of the other players plus him or herself. Grades were on a scale from 1 to 6, with a grade of 1 being “very good” and 6 “insufficient.” (Our experiment was conducted in Germany, and German students are familiar with this grading scale from their high school days.) After the grades had been submitted, the average grade given to every player was revealed to the group. The grades that the players gave to themselves were not revealed publicly. This grading scheme did not affect payoffs directly, but it did provide a vehicle for “peer review” by allowing the players to signal their approval or disapproval of the choices made by the members of their group. In the Ex-Ante-Review treatment, the review was done after the pledges but before the contributions were made. In the Mid-Point-Review, the review was done between the first and second contribution stage. Finally, in the Ex-Post-Review treatment, the review came after the second contribution stage.

Fig. 1
figure 1

Timeline for the experiment

In this game, the incentive to contribute red chips depends very much on players’ expectations or “beliefs” for how many red chips their co-players will contribute. This is because contributions of red chips are very costly to individuals and even inefficient for the group so long as fewer than 50 chips are contributed in total, and at least some players must contribute red chips in addition to black chips in order for the group contribution to top 50. To obtain an estimate of each player’s expectations, just before contributions were chosen players were asked to guess how many chips their co-players would contribute on average. To ensure that estimates reflected players’ actual expectations, they were given a reward of €1 for correct guesses (meaning guesses that were within a range defined by the actual mean plus or minus 1).

In addition to the 20 poker chips, each player was given an “endowment fund” of €19 to ensure that he or she could not be left out of pocket. Footnote 3 The endowment fund could not be used to purchase chips, and so you can think of it as representing a country’s “capital stock,” a resource that cannot be used to mitigate climate change but that would be at risk should dangerous climate change occur. Given this fund, a player’s worst possible payoff in the experiment was €0, and her best possible payoff was €38.50. The full cooperative payoff to each player was €24 and the Nash equilibrium payoff was €14.50. After the game was played, the participants were asked to complete a follow-up questionnaire. After that, the threshold was determined by the randomized spin of a computer wheel, with the “ends” set at 50 and 100. The wheel, representing Nature, determined whether, for those groups contributing between 50 and 100 chips, the players would lose the €20. Depending on the outcome of the spin, the players were then given their final payout in cash.

The experimental sessions were held in a computer lab at the University of Magdeburg, using undergraduate students recruited from the general student population. In total, 195 students participated in the experiment, each student taking part in one treatment only. In each session, 20 or 25 subjects were seated at linked computers (game software Ztree; see Fischbacher 2007) and randomly assigned to five-person groups.Footnote 4 Throughout the game, each player was identified by a different letter, from A to E. The experimental instructions handed out to the students included several numerical examples and control questions. The control questions tested subjects’ understanding of the game to ensure that they were aware of the available strategies and the implications of making different choices. After reading the instructions and answering the control questions correctly, every subject first played the game in three practice rounds. It was common knowledge that the composition of every group would be changed between these rounds. It was also common knowledge that group composition would be changed again before the game was played for real.

3 Results

Figure 2 presents mean values for the targets, pledges, and contributions. For every treatment, the mean target exceeds the mean pledge, which in turn exceeds the mean contribution. In short, individual pledges fell below the group target and contributions fell below the pledges.Footnote 5 Figure 2 also shows that the mean values for targets, pledges, and contributions are higher for the three review treatments compared to the control without review, but the differences are generally small. Statistical analyses of these data (see our electronic supplementary materials) show that the differences in targets between No-Review and Ex-Ante-Review and between No-Review and Ex-Post-Review are significant (Mann–Whitney-Wilcoxon rank-sum test (MWW), P < .05 each). Moreover, the differences in pledges between No-Review and the three review treatments are at least weakly significant (MWW test, P < .10 each). The differences in contributions between No-Review and Ex-Ante-Review as well as between No-Review and Mid-Point-Review are not statistically significant (MWW test, P > .30 each).

Fig. 2
figure 2

Group averages for targets, pledges, and contributions by treatment

The largest aggregate contributions are found in the Ex-Post-Review. For this treatment, the average contribution is 19 % higher than in No-Review and this difference is on the borderline of statistical significance (MWW test, P = 0.112).Footnote 6 On average, this means that the probability of catastrophe decreases from 84 % in No-Review to 62 % in Ex-Post-Review. Figure 3 shows both the distribution of group contributions and the median value. In Ex-Post-Review, the median is above 70 while in the remaining three treatments it is around 60.

Fig. 3
figure 3

Distribution of group contributions by treatment

Regression analysis reveals the critical chain of causality that underpins the effects of the review process (Table 1). The review process increases individual proposals for the group targets (and, hence, group targets) directly, with the effect being statistically significant for the Ex-Ante-Review and Ex-Post-Review treatments. The review process does not have any other direct effects, but it does have indirect effects. First, the review process increases pledges indirectly, because pledges increase with the group target. Second, the review process increases players’ expectations for how much their co-players will contribute, as these expectations depend on the pledges made by other players (which in turn depend on the group targets). Finally, the review process increases contributions indirectly. It does this, first, by increasing pledges (which depend on targets), as people who pledge more tend to contribute more; and, second, by increasing players’ expectations for how much their co-players will contribute, as contributions increase in these expectations. The review process does not affect contributions directly.

Table 1 Linear regressions of individual proposals, pledges, beliefs, and contributions

Figure 4 arranges all groups according to their contribution level, from lowest to highest. It also shows the corresponding group values for pledges and expectations. All groups with high contributions have high pledges and expectations. Groups with low pledges and expectations tend to have low contributions. However, not all groups with high pledges and expectations have high contributions. High pledges and high expectations thus appear to be necessary but not sufficient for high contributions.

Fig. 4
figure 4

Total of pledges, expectations, and contributions

The behavior of individuals mirrors these observations about groups. Figure 5 shows that, with one exception (in the Ex-Ante-Review treatment), individuals who pledged to give a low contribution gave a low contribution, but that the players who pledged high sometimes gave a high contribution and sometimes gave a low contribution. Similarly, Fig. 6 shows that players with low expectations tended to contribute low, but that the players with high expectations sometimes contributed low and sometimes contributed high. As observed for group behavior, high pledges and expectations are a necessary condition for high contributions by individuals, but they are not sufficient.

Fig. 5
figure 5

Individual pledges and contributions

Fig. 6
figure 6

Beliefs and contributions

Table 2 shows the grades that players on average gave to their co-players and to themselves. Average grades were better when the reviews were given earlier rather than later in the process (falling from 2.1 in Ex-Ante-Review to 3.3 in Mid-Point-Review to 3.6 in Ex-Post-Review; remember that higher values imply a worse grade), arguably because things generally looked better earlier in the game. The grades that subjects gave to themselves were better than the grades they received by their co-players, implying that the players applied different standards to themselves than to their co-players. A plausible explanation for this is “self-serving bias,” a tendency for people to perceive themselves in a more positive light than others do (Baumeister 1998). However, we do not find evidence that the difference between peer and self-assessment has any effect on behavior. Subjects who gave themselves a grade that was much better than the grade given to them by their peers did not behave differently in subsequent stages than the subjects who gave themselves a grade that was closer to the one given to them by their peers.

Table 2 Grades

Regression analysis (Table 3) shows that higher pledges cause players to be given a better grade only in the Ex-Ante-Review treatment, where players have no other information to go on. In the Mid-Point-Review, a player’s grade is affected only by her first period contribution, not her pledge, indicating that players care about actions, not words. Finally, in the Ex-Post-Review, a player’s grade is significantly affected by his first- and second-stage contributions as well as by his pledge. In this case, however, the coefficient on pledges has the opposite sign compared with the Ex-Ante-Review treatment. This is because, in the Ex-Post-Review treatment, a player’s peers can see whether his contributions correspond to his pledge. The data show that people who pledged to make a low contribution tended to contribute very little, but that people who pledged to make a high contribution sometimes contributed very few chips. People who gave high pledges were thus graded down because their contributions often fell short of their pledges.

Table 3 Linear regressions of average received grades

We observe a remarkably high variation in contributions in all treatments, ranging from 35 to 78 in No-Review to 54–85 in Ex-Ante-Review, 30–92 in Mid-Point-Review, and 25–95 in Ex-Post-Review (see Fig. 3). Some groups contributed so little as to make “catastrophe” inevitable, whereas other groups contributed so much that the risk of “catastrophe” was remote.

To explore this variation in group-behavior more systematically, we divided the groups into three categories (Table 4). “Successful” groups (11 in total) contributed at least 75 chips in total; these groups had a better than even chance of avoiding catastrophe. “Intermediate” groups (22) contributed between 50 and 75 chips; these groups were more likely than not to trigger “catastrophe.” Finally, “unsuccessful” groups (6) contributed 50 or fewer chips; these groups were sure to trigger “catastrophe.” Focusing on the contrast between the successful and unsuccessful groups, the successful groups chose a higher target (MWW test, P = .0432), pledged to contribute more (MWW test, P = .0180), had higher expectations about other players’ contributions (MWW test, P = .0025), and made higher first-stage contributions (MWW test, P = .0009).

Table 4 Comparison between groups with different performance

To further explore the effect of group composition, we defined “free riders” as players who contributed five or fewer chips in the first stage. In the successful groups, free riding was rare. Only one group had a free rider; the average across all these groups was just 0.09. In the unsuccessful groups, the average number of free riders was much higher (1.66), with some groups having as many as three free riders. The difference in free riding between the successful and unsuccessful groups was highly significant (MWW test, P = .0004). It thus seems that the presence of one or two “free riders” virtually guarantees a bad overall outcome. This is not only because the free riders fail to contribute. It is also because the behavior of the free riders causes the conditional cooperators to reduce their contributions in the second stage.

There is some controversy as to whether the 2 °C goal first endorsed by the parties to the Framework Convention in Copenhagen and Cancun is “scientifically meaningful,” let alone achievable (Victor and Kennel 2014). In our experiment, targets expressed in terms of a group’s total contribution of chips are indisputably meaningful, and the ones actually chosen are all achievable by design. The goal agreed in Paris—to limit mean global temperature change “to well below 2 °C above pre-industrial levels”—is even more ambitious than the earlier one, but does it herald stronger future collective action? In our experiment, groups that chose higher targets tended to contribute more. Groups that chose the maximum target of 100 chips contributed on average 70 chips, whereas groups that chose a lower target contributed just 59 chips on average. However, comparison of the “successful” and “unsuccessful” groups shows that the group target is only one of a multiple of preconditions for successful action. Successful groups—the ones that have a better than even chance of avoiding “catastrophe”—not only chose an ambitious target; their members also made equally ambitious pledges, had positive expectations for how much their co-players will contribute, and undertook substantial early action. Whether the ambitious goal agreed in Paris turns out to be a harbinger of substantial future emission reductions may thus depend on whether it raises expectations for national action and whether countries fulfill these expectations by taking visible strong action over the next few years.

4 Conclusions

As in our experiment, analyses of the contributions pledged in the run up to the Paris conference predict that they will fall short of achieving the 2 °C goal chosen by the same group of countries (International Energy Agency 2015, UNFCCC Secretariat 2015). Actual contributions may even come in below pledges as happened in our experiment.

Of course, our experiment focused on only one particular aspect of the pledge and review mechanism, namely its potential to change behavior under perfect information about players’ contributions. Whether the pledge and review mechanism adopted in Paris will turn out to be effective may also depend on factors we did not consider in our experiment, such as transparency, public attention, and comparability. However, it is not obvious that a consideration of other factors will favor cooperation. Unlike in our experiment, the pledges submitted by countries in the run-up to Paris were expressed in different terms (total emissions, emissions intensity, emissions relative to business as usual, emissions with or without international offsets, and so on), making it difficult to know whether similar countries are pledging to make similar sacrifices (Aldy and Pizer 2014). Also, countries may be more interested in a country’s effort, which is imperfectly correlated with its emissions, whereas in our experiment effort and contributions are equivalent. Cooperation by more than five countries will be needed to stabilize atmospheric concentrations of greenhouse gases, and free rider incentives generally increase with group size. Finally, efforts by a subset of countries to limit emissions may be further undermined by “globalization.” For example, should only a sub-group of countries limits its emissions, market prices, including energy prices, will change, causing production of greenhouse gas intensive goods to shift towards the countries that do not limit their emissions. Similarly, the drop in fossil fuel prices brought about by a sub-group’s efforts to limit emissions may increase the amount of fossil fuels consumed by other countries. Both of these responses lead to “leakage” (Felder and Rutherford 1993). Future research may show how a pledge and review mechanism fares under these alternative conditions.

We find that the pledge and review process may lead to small increases in contributions, and we find no evidence that it is harmful to cooperation. Other kinds of non-binding institutions have been found to undermine cooperation by adding another source of frustration to the game (Dannenberg 2016). The implication of our research is thus not that the pledge and review mechanism should be replaced, but that it should be combined with other measures.

Our results for “successful” and “unsuccessful” groups might seem to suggest that conditional cooperators would do better by shunning the “free riders” and forming a “club” of their own—a group of likeminded countries that can deny non-members the benefits of the club members’ actions (Keohane and Victor 2011). However, emission reductions are a global public good, and no country can be excluded from benefiting from the emission reductions achieved by club members. To limit climate change, clubs must therefore focus on something like cooperation in the development of a new technology or special trade arrangements—and then leverage the supply of this “good” for the purpose of getting all countries to limit emissions (Nordhaus 2015). A related but somewhat different approach emphasizes the need for agreements to focus on choices involving individual gases and sectors that facilitate coordination, with conditional cooperators offering a combination of sticks and carrots to broaden participation and increase contributions (Barrett 2003). As our experiment shows that it would be imprudent to solely rely on a review process to change the behavior of free riders, these approaches deserve more serious consideration. The priority, we believe, should be to develop coordination agreements, including the effort already underway to negotiate an amendment to limit HFCs in the Montreal Protocol, as these would be complementary to the Paris Agreement.