1 Introduction

Achieving cooperation in group settings where individual and collective interests conflict poses a longstanding challenge. Individuals are incentivized to act in their own self-interest rather than cooperating for the collective good, which can lead to detrimental outcomes like the tragedy of the commons. Effective cooperation requires aligning individual and group interests through mechanisms like incentives, penalties for defection, communication, trust-building, and reputation systems. However, this remains difficult, as individuals often find ways to free-ride if acting selfishly benefits them more than cooperating. Examples include overfishing, public goods provision, and reducing carbon emissions.

Public goods experiments using the voluntary contribution mechanism (VCM) are a classic way to study cooperation in these social dilemma situations. In a typical VCM experiment, subjects are matched in small groups and contribute towards a non-excludable public good over multiple periods. When the group size is n, and one unit contribution to the group account increases each member’s earnings by p, with \(p<1\), and \(np>1\), the Nash equilibrium is zero contribution, whereas the efficient outcome is full contribution. Results from VCM experiments show that individuals generally start with some positive contribution, which decays over time without incentives (Isaac & Walker, 1988).

In their seminal studies, Fehr and Gächter (2000, 2002) implement a design where subjects play a game of known length, observe each others’ contributions each period, and decide how much to deduct from the earnings of other group members, i.e., sanction each other.Footnote 1 This deduction is costly for the subject; therefore, it is classified as a form of strong reciprocity (costly altruism). When subjects are allowed to assign punishment points to their group members, this can facilitate cooperation and keep the contribution levels high in the experiment. Herrmann et al. (2008) extended this approach by conducting a study comparing the effectiveness of punishment in sustaining cooperation across 16 cities worldwide. They found that punishment successfully increased cooperation in Western cities like Boston, Copenhagen, St. Gallen, Zurich, and Nottingham.Footnote 2 However, it failed to do so in Istanbul and other similar cities. Unlike Western subjects, Istanbul participants exhibited more antisocial punishment, where high contributors were targeted. Overall, punishment had low effectiveness in these cities.Footnote 3, Footnote 4

Our study revisits cooperation and punishment in Istanbul using the same subject pool but a between-subject design, in contrast to the within-subject design used by Herrmann et al. (2008). In their study, all subjects first played the VCM experiment without punishment and then with punishment. In our study, subjects participated in either the game without punishment (N-experiment) or with punishment (P-Experiment), but not both. Under this between-subject approach, we find substantially higher contributions in Istanbul when punishment is introduced compared to no punishment. However, average contributions still remain below levels observed in Western cities.

We identify two critical factors behind cooperation outcomes in the P-Experiment:

  1. (1)

    The distribution of first-period contributions exhibits significant heterogeneity across groups. Groups starting at high initial cooperation sustain those levels, while groups starting at low cooperation remain low throughout.

  2. (2)

    Subjects follow simple contribution rules based on prior contributions and received punishment. We estimate these linear rules from the data.

The interaction of these two factors generates strong persistence in contributions over time. To demonstrate this, we conduct a counterfactual experiment using an agent-based model. Feeding the estimated decision rules and actual first-period data accurately reproduces the evolution of contributions over the course of the game.

Analyzing data from Herrmann et al. (2008), we find similar first-period heterogeneity and contribution persistence over time in cities resembling Istanbul. In particular, these stylized patterns are common across cities that share socio-economic similarities with Istanbul, particularly in Athens (Greece), Dnipropetrovsk (Ukraine), Samara (Russia), Minsk (Belarus), Riyadh (Saudi Arabia), and Muscat (Oman). Our Istanbul results likely extend to these cities under a between-subject design.

Our findings suggest that accounting for heterogeneous initial cooperation, rather than just city-level averages, is critical for valid cross-society comparisons. The decay induced by a prior N-experiment in a within-subject design masks actual cooperation capacity. More broadly, populations exhibiting diversity in initial cooperation may see larger gains from institutions designed to enhance cooperation. This study isolates the effect of experimental design and highlights the pivotal role of initial conditions for cooperation outcomes. It provides guidance for designing future cross-society cooperation experiments to produce desirable results.

The rest of the paper is organized as follows: in Sect. 2, we describe the experiment. In Sect. 3, we report our empirical findings and discuss salient patterns that guide contribution and sanction decision rules. In Sect. 4, we propose and report the results of our agent-based model motivated by the estimated decision rules based on prior contributions and sanctions. In Sect. 5, we compare our findings in detail with those by Herrmann et al. (2008), and Sect. 6 concludes.

2 Experimental design and procedures

The experiment is based on the design by Fehr and Gächter (2000) and involves two treatments. In the N-experiment, subjects are randomly matched into groups of 4 and interact within the same group for 10 periods. Each period, subjects receive an endowment of 20 tokens, from which they can contribute to a “group project”. For every token invested in the group project, each group member earns 0.4 tokens. Subjects’ period earnings are calculated as the sum of earnings from the group project and the part of the endowment not invested in the group project. The P-experiment builds on the N-experiment but adds a punishment stage. After observing group members’ contributions (but not their identities), subjects can assign costly punishment points. Each point assigned costs the subject 1 token and reduces the target’s earnings by 3 tokens. The total reduction in a subject’s earnings is limited to the earnings from the contribution stage. We discuss further details of the two stages in the Appendix.

3 Results

We provide summary statistics for the observed contribution levels in Table 1. The initial contributions in our study are similar in both treatments, starting around 9 tokens. However, by the 10\({{\textrm{th}}}\) period, average contributions decline to 2.85 tokens in the N-experiment, whereas they rise moderately to about 12 tokens in the P-experiment. Overall, contributions in the P-experiment are significantly higher than in the N-experiment (\(p=0.019\)), even though contributions in the first period do not differ significantly between the two experiments (\(p=0.950\)). Non-parametric testing reveals sufficient statistical power (\(pw = 0.78\)) to detect significant differences in average contributions between the N and P-experiments across all periods.Footnote 5

Herrmann et al. (2008) employ the design by Fehr and Gächter (2000) to measure the performance of the punishment mechanism among 1120 subjects in 16 different cities worldwide, including Istanbul. We report the contributions of subjects in Istanbul, Boston, and Copenhagen from that study in Table 1.Footnote 6

Using data from Istanbul, Herrmann et al. (2008) find a difference between the two treatments in Period 1, with lower contributions in the P-experiment than the N-experiment. However, they find no significant difference when considering all periods. Our summary statistics and non-parametric test results suggest that a between-subject design increases the effectiveness of punishment in sustaining cooperation in Istanbul. However, compared to results from Boston and Copenhagen, subjects in Istanbul still contribute at lower rates on average, even with punishment. Additionally, compared to those cities, the availability of punishment leads to lower average earnings in Istanbul, both in the current study and Herrmann et al. (2008).

Table 1 Mean contributions and earnings

We next report and discuss the results from our two treatments.

3.1 N-experiment

Contributions in the N-experiment exhibit the typical decay pattern seen in public good games. We present the evolution of contributions averaged over groups for the N and P-experiments in Fig. 1. The frequency of zero contributions in the N-experiment is 27% in the first half but increases to 46% in the second half. Of the 15 groups, 11 end up at very low contribution levels (3 tokens or less), 3 at moderate levels (6.75\(-\)8.5 tokens), and only 1 reaches 12 tokens by period 10. While these values point out some degree of heterogeneity, the dominant trend in the N-experiment is a clear decline in contributions over time. This contrasts the evolution of contributions in the P-Experiment, as discussed in detail in the next Sect. 3.2.Footnote 7

Fig. 1
figure 1

Timeline of average contribution by treatment. Notes: The left and right panel displays the average contribution in the N-experiment and P-experiment, respectively

3.2 P-experiment

The P-experiment exhibits two salient patterns based on initial average contributions within each group, as illustrated in Fig. 2. First, we observe that group member update their contributions towards the group average. When a group member observes that her contribution falls short of the group average, she seldom decreases her contribution in the next period. Similarly, when her contribution exceeds the group average, next-period contribution is frequently either lower or the same. Second, we observe that punishment decisions are often social. Subjects are more likely to punish those who contribute less than them, with the likelihood and amount of punishment increasing as the difference in contributions grows. Further, the amount of punishment is negatively related to the contribution of the target subject and positively related to the average contribution of the remaining group members. We report the quantitative details of these empirical patterns in the Appendix.

Fig. 2
figure 2

Evolution of average contributions in the P-experiment. Notes: The left panel shows the average contribution over 10 periods for the 7 groups with the lowest first-period average contributions, while the right panel shows the same statistic for the remaining 8 groups

We next propose an agent-based model that combines the salient empirical decision patterns of our experiment with first-period contributions.

4 Agent-based modeling

In addition to the between-subject design, initial heterogeneity across groups emerges as a decisive factor affecting cooperation in the P-experiment.Footnote 8 Our findings show that if subjects in Istanbul start off collaborating, they sustain high contribution levels comparable to Boston and Copenhagen. Figure 2, which clusters groups by their average contribution in the initial round, highlights the importance of the first-period contributions. The seven groups with low starting contributions are on the left, while the eight groups with high starting contributions are on the right. Figure 2 reveals that groups that contribute above 43.75% on average in the first round maintain rates of at least 61.25% by the final period.

To demonstrate the decisive role of initial contributions, we use an agent-based modeling approach. We first take the first-period contributions data from the P-experiment. Then we impose simple state-dependent linear decision rules based on patterns observed in the data. These rules determine how agents update contributions over time. We then run Monte Carlo simulations for each of the 15 groups and examine the evolution of contribution over periods. The details of our procedure are discussed in the Appendix and the timeline of the agent-based modeling algorithm is summarized in Fig. 3.

Fig. 3
figure 3

Timeline of the model. Notes: Fig. 3 displays the timeline of agent-based modeling for the P-experiment. Only first-period contributions are taken from the experiment, while the remaining data are generated through agent-based modeling. Subjects’ decisions to keep or change their contribution level in subsequent periods are determined by Table 3. If a subject changes their contribution level, their contribution in the next period is generated using Table 4. Sanctioning decisions are made using Table 6

Fig. 4
figure 4

Agent-based model simulation results. Notes: The horizontal axis shows the periods, and the vertical axis shows the average group contribution. The solid line represents the group averages from the P-experiment data. The dotted line represents the group-specific average agent-based model results from the Monte Carlo simulations, with the shaded gray areas representing the resulting 2-standard deviation confidence intervals

Figure 4 displays the results of the simulations. For each group, it shows the average contributions from the simulations (dotted lines) with 2-standard deviation confidence intervals (shaded areas). These are plotted alongside the actual average contributions from the experiment (solid lines). Despite its simplicity, the imposed decision rule mimics not only the actual final contributions but also the contribution dynamics for most groups.

5 Comparison with Herrmann et al. (2008)

The heterogeneity observed in the initial average contributions and its persistence throughout the game is not unique to our experiment. As such, the detrimental effects of a within-subject design may extend beyond Istanbul.

To compare our findings to those of Herrmann et al. (2008) for Istanbul, we plot first-period average group contributions along with their last-period counterparts in Fig. 5. This illustration shows that both datasets exhibit a high degree of variability in the first-period average group contributions.

However, there are two stark differences: First, only 2 out of 15 groups in our experiment contributed at a rate lower than 25% in the first period, compared to 7 out of 16 groups in Herrmann et al. (2008)’s experiment. Second, while the slope between the first and last period average group contributions in Herrmann et al. (2008)’s data are close to unity (0.997, with a standard error of 0.243), it is considerably steeper in our experiment (1.352, with a standard error of 0.186). This suggests that while average group contributions in Herrmann et al. (2008) stagnated on average over time, groups in our experiment managed on average to raise their average group contributions over periods.

Given that our experiment was conducted at the same university with the same subject pool, we argue that these differences are due to our between-subject design versus Herrmann et al. (2008)’s within-subject design, in which all average group contributions converged to zero by the last period of the no-punishment treatment.

Fig. 5
figure 5

Comparison of average contributions in the P-experiment. Notes: Each point in the figure represents the average contribution of a group in the P-experiment. The horizontal axis shows the group’s average contribution in the first period, and the vertical axis shows the group’s average contribution in the tenth period. Groups below the dashed 45\(^\circ\) line did not improve their average contribution from the first period, while groups above the line did improve their average contribution. The solid line represents the linear best-fit line, and the shaded gray areas show the 95% confidence intervals

Istanbul was not the only city that demonstrated significant heterogeneity in average first-period contributions that persisted throughout the P-experiment. The experiments by Herrmann et al. (2008) were conducted in various cities with marked socio-economic differences. To explore which of these cities resemble Istanbul the most, we conduct a principal component analysis (PCA) using three variables: (i) straight-line physical distance to Istanbul, (ii) GDP per capita (in 2017 current US dollars), and (iii) cultural and psychological distance to Istanbul (Turkey) via Muthukrishna (2018)’s WEIRD scale index. Our PCA reveals that a cluster of six cities from Herrmann et al. (2008) resembles Istanbul the most: Athens (Greece), Dnipropetrovsk (Ukraine), Samara (Russia), Minsk (Belarus), Riyadh (Saudi Arabia), and Muscat (Oman).Footnote 9

Figure 6 shows the first- and last-period average group contributions for the six cities from Herrmann et al. (2008) that most resemble Istanbul. All of these cities exhibit significant first- and last-period heterogeneity in average group contributions, with a strong positive correlation between the two that is close to unity. Additionally, a city-fixed-effect regression of last-period average group contributions on first-period average group contributions for these six cities yields a slope coefficient of 0.975 (with standard error 0.140), which is not statistically different from the slope of 0.997 (with standard error 0.243) that we estimate for Istanbul.Footnote 10 Therefore, we conclude that Istanbul does not single out with its idiosyncrasy vis-à-vis heterogeneity in first-period average contributions and its persistence throughout the game. Instead, these patterns are statistically common across cities that share socioeconomic similarities with Istanbul. As such, the increase in Istanbul’s contribution rates due to the between-subject design could plausibly extend to other cities with similar characteristics.

Fig. 6
figure 6

First-period and last-period contributions by city (Herrmann et al., 2008). Notes: Each point in the figure represents the average contribution of a group in the P-experiment. The horizontal axis shows the group’s average contribution in the first period, and the vertical axis shows the group’s average contribution in the tenth period. Groups below the dashed 45-degree line did not improve their average contribution from the first period, while groups above the line did improve their average contribution. The solid line represents the linear best-fit line, and the shaded gray areas show the 95% confidence intervals

6 Discussion and concluding remarks

In this paper, we show that contributions in a public goods game with punishment are significantly higher in Istanbul under a between-subject design than under a within-subject design in which the no-punishment condition precedes the punishment condition. This highlights the detrimental effect of prior experience without punishment on cooperation. However, Istanbul’s average contribution remains below Western city levels, suggesting that limited cooperation persists even with sanctions.

We identify two key factors behind cooperation patterns: heterogeneous initial contributions, extending Burlando and Guala (2005), and simple contribution updating rules based on prior contributions and sanction points estimated from the data. An agent-based model verifies that the interaction of these factors generates strong persistence in contribution levels over time.Footnote 11

The data analysis from Herrmann et al. (2008) reveals similar heterogeneity and persistence in contribution levels over time in cities that resemble Istanbul. Therefore, our results are likely generalizable to these settings if a between-subject design is employed, which would eliminate the cooperation decay induced by a prior no-punishment condition.

Our results highlight the crucial role of initial contributions in shaping subsequent cooperation within a group, which is consistent with studies that have shown the effectiveness of grouping subjects based on contribution levels. For example, Gunnthorsdottir et al. (2007) find that contribution decay is lower when subjects are matched based on prior actions rather than randomly matched. Other relevant studies include Gächter and Thöni (2005), Ones and Putterman (2007), and Gunnthorsdottir et al. (2010). Brekke et al. (2011) also allow for endogenous group formation based on charity donations and find that cooperation is improved among donors.

Similar to Gächter et al. (2010), we posit that culture affects cooperation through beliefs and punishment responses. The variance we observe in initial contributions underscores the power of beliefs, which is consistent with the reduced contributions observed following a within-subject design as in Herrmann et al. (2008). Personal risk preferences also influence first-round contributions, which are then adjusted based on received sanctions. A group that starts with low contributions and antisocial punishment risks cooperation failure.

Our study makes several contributions. It demonstrates the critical impact of experimental design and initial conditions on cooperation outcomes. It provides guidance for robust cross-society experiment design by underscoring within-group heterogeneity. Our findings uniquely isolate the effect of first-round divergence, complementing research on culture and conditional cooperation in social dilemmas. The insights into contribution updating rules and belief formation advance theoretical understanding of how cooperation evolves.

Future work should further explore the sources of heterogeneous initial contributions and beliefs across individuals. Overall, highlighting within-society variation rather than just cross-society differences is critical for drawing valid inferences and crafting policies to encourage cooperation. This study demonstrates the inadequacy of only considering city-level aggregates when cooperation hinges fundamentally on initial beliefs within subgroups.