Keywords

1 Introduction

Mixed-motive games, comprise a subset of games in which individual and collective incentives are not entirely aligned. These games describe situations in which the combined effects of every individual’s selfishness do not yield a good outcome for the group, a problem also known as the collective action problem [24]. Two basic properties define this type of games [8]: a) every individual is incentivized to socially defect and b) all individuals are better off if all cooperate than if all defect.

Olson develops the notion of a collective action problem starting from the raison d’etre of organizations [24]. These, as he describes, are groups that serve to further the interests of their members. The problem emerges when the individuals of such groups also have antagonistic incentives to those common to the collective. Individuals, in this case, are left to choose between harming the organization as whole in favor of their own benefit, or to pass on the opportunity for bigger gains in favor of the group. A collective action problem happens when the former is systematically preferred over the latter.

Global warming is a real-world case of the collective action problem. In it, most players—be it an individual, institution, or government—have an incentive to emit as much greenhouse gases as desired—for matters of comfort, financial gains, or popularity—, regardless of how much others are emitting. If collective emissions surpass some threshold to these ends, the system increasingly dips into an undesirable state that is bad for all involved.

It has been noted that real-world communities are capable of circumventing this problem with varying success, conditioned on variables such as group size, the existence of a communication channel, etc. [25, 26]. These are tied and serve to strengthen the idea of social norms; a guide of conduct, or the expectation individuals hold of others in certain situations [22].

Social norms and norm enforcement mechanisms can be a useful tool in guiding groups of people out of social dilemmas [17], but they can also be incorporated into multiagent systems (MAS) [5, 6]. This institutional machinery provides ways of governing mixed-motive games either via centralized solutions—when a central governing body is tasked with running the institutional apparatus by itself—or decentralized solutions—when the normative system is conducted by the agents in the system.

Decentralized norm-enforcement approaches have been used to deal with degrading system properties in MASs [9, 15], such as the collective action problem. However, these decentralized solutions either imply a) pro-social behavior from the agents or b) some form of direct or indirect retaliatory capacity—e.g. having the choice not to cooperate in future interactions—that is at least similar in intensity to the harm caused by the aggressor. We acknowledge the effectiveness of these solutions in some cases but also recognize they are no panacea.

For instance, how can one—agent or group of agents—successfully drive a complex MAS towards social order [5] from within without assuming anything about others’ beliefs, intentions, or goals, and given that punishing uncompliant behavior is not desirable or allowed? This problem is akin to many situations in modern society; thus far is impossible to know the beliefs and intentions of every person we might interact with, and not every problem we face is ideally solvable by a “taking matters into own hands” approach.

Consider as an example the problem with burglary. We—as society—don’t expect social norms and good moral values to completely solve the problem—although they certainly change the rate to which it happens—and when a burglary does happen, we don’t expect the victim to return the favor with a response of similar intensity—like stealing from the aggressor’s house.

A similar issue may also occur in MASs. Consider a system of self-driving autonomous vehicles. Every vehicle in it might have an incentive to get to its destination as fast as possible. Suppose that, to this end, a vehicle engages in careless maneuvers and risky overtakes to gain a few extra seconds, harming others—safety and/or performance—close to it in the process. Could we safely assume agents in this system are pro-social to the degree that such a situation would never happen?

This might not always be a good premise. In this example, the system itself is embedded in a competitive environment of firms fiercely fighting for market share. Performance, in the form of getting to the final destination faster, might represent getting a bigger slice of the pie. Does the designer behind the agent have the right incentives to design pro-social agents? Social defection for the sake of financial gains is not unthinkable by any means in the automobile industryFootnote 1.

Now, suppose that an uncompliant behavior has been identified by another vehicle close by. Could any form of punishment by the latter be accomplished without compromising the safety of passengers riding in both vehicles? Furthermore, even if we agree on the safety to reciprocate, there are many situations where direct retaliation might be undesirable. For instance, how do we address fairness in these systems? If highly interconnected, even a small violation could be met with a huge wave of public bashing, similar to the problem of internet cancel cultureFootnote 2.

In case it is not safe to assume other agents will cooperate and it is not desirable that agents directly or indirectly punish each other, we may need to resort to centralized governance of some kind. Jones and Sergot (1994) propose two complementary models of centralized norm enforcement [16]:

  1. 1.

    Regimentation: Assumes agents can be controlled by some external entity, therefore non-compliant behavior does not occur.

  2. 2.

    Regulation: Assumes agents can violate norms, and violations may be sanctioned when detected.

A drawback of the former is that it constrains agents’ autonomy [22]. Furthermore, implementing a regimentation system is not necessarily trivial; edge cases may arise such that violations may still occur [16]. On the other hand, the latter preserves—to some degree—agents’ autonomy by allowing their actions to violate norms.

This work proposes a way out of the collective action problem in mixed-motive multiagent reinforcement learning (MARL) environments through centralized regulation. The proposal involves enhancing regular mixed-motive environments with a normative system, controlled by a reinforcement learning (RL) agent playing the role of a regulator; able to set norms and sanctions of the system according to the ADICO grammar of institutions [7]. The primary aim of this proposal is to solve the collective action problem in mixed-motive MARL environments given two assumptions:

  1. 1.

    We have no prior knowledge about the agents’ architectures, thus it’s impossible to predict their incentives and behaviors.

  2. 2.

    It’s not desirable for agents in the system to punish each other.

We also show that, by employing this method, social control can be achieved using only off-the-shelf, traditional RL agent architecturesFootnote 3\(^{{,}}\)Footnote 4.

2 Related Work

Many studies have addressed the collective action problem in mixed-motive MARL environments [9, 15, 18, 20, 27]. Still, most of them have tackled this problem from an agent-centric perspective; their solutions involve modifying an RL architecture to the specific needs of multiagent mixed-motive environments. This has been accomplished in different ways, such as allowing agents to have pro-social intrinsic motivation [15, 20, 27], coupling agents with a reciprocity mechanism [9, 18], and deploying agents with a normative reasoning engine [23].

This very same problem—and others—has also been addressed in MASs through the adoption of electronic institutions (EI) [10, 11], which specifies among other definitions, a set of rules that determines what the agents in the system ought to do or not under predefined circumstances, similar to the role traditional institutions play [1]. Likewise, the autonomic electronic institution (AEI) is also a framework that can be used to govern MASs and may be better suited to cope with the dynamism of complex systems of self-adapting agents due to its autonomic capabilities (norm-setting at run-time) [1, 2].

Our work here presented is similar to the AEI framework in the sense that it also proposes to overcome a system-level problem by dynamically regulating the system’s norms at run-time. Still, it differs from such framework by leveraging in a single agent the learning capabilities RL together with the normative concepts spread across a broad literature. Our work also broadly resembles the AI Economist framework proposed by Zhen et al. [33], that allows for the training of RL social planners, that learn optimal tax policies in a multiagent environment of adaptable economic actors by observing and optimizing for macro-properties of the system (productivity and equality).

In summary, to the best of our knowledge, none of the studies cited above have: a) proposed a centralized norm enforcement solution to mixed-motive MARL environments using another RL agent as a central governing authority, and b) proposed a solution that uses only traditional RL architectures when peer retaliation is not allowed.

3 Normative Systems and the ADICO Grammar of Institutions

One way of preventing MASs from falling into social disorder [5] is to augment the system with a normative qualifier. Thus, a normative system can be simply defined as one in which norms and normative concepts interfere with its outcomes [22]. In these settings, despite not having an unified definition, a norm can be generally described as a behavioral expectation the majority of individuals in a group hold of others in the same group in certain situations [31].

In normative systems, norms that are not complied with might be subject to being sanctioned. Sanctions can be generally classified into direct material sanctions, that have an immediate negative effect on a resource the agent cherish, such as a fine, or indirect social sanctions, such as a lowering effect on the agent’s reputation, that can influence its future within the system [4]. Nardin [22] also describes a third type of sanction; psychological sactions are those inflicted by an agent to himself as a function of the agent’s internal emotional state.

The ADICO grammar of institutions [7] provides a framework under which norms can be conceived and operationalized. The ADICO grammar is defined within five dimensions:

  • Attributes: is the set of variables that defines to whom the institutional statement is applied.

  • Deontic: is a holder from the three modal operations from deontic logic: may (permitted), must (obliged), and must not (forbidden). These are used to distinguish prescriptive from nonprescriptive statements.

  • A im: describes a particular action or set of actions to which the deontic operator is assigned.

  • Conditions: defines the context—when, where, how, etc.—an action is obliged, permitted or forbidden.

  • Or else: defines the sanctions imposed for not following the norm

Example 1

The norm All Brazilian citizens, 18 years of age or older, must vote in a presidential candidate every four years, or else he/she will be unable to renew his/her passport as per defined in the ADICO grammar, can be broken down into: A: Brazilian citizens, 18 years of age or older, D: must, I: vote in a presidential candidate, C: every four years, O: will be unable to renew his/her passport.

4 Reinforcement Learning (RL)

4.1 Single-Agent Reinforcement Learning

The reinforcement learning task mathematically formalizes the path of an agent interacting with an environment, receiving feedback—positive or negative—for its actions, and learning from them. This formalization is accomplished through the Markov decision process (MDP), defined by the tuple \(\langle \mathcal {S}, \mathcal {A}, \mathcal {R}, \mathcal {P}, \gamma \rangle \) where \(\mathcal {S}\) denotes a finite set of environment states; \(\mathcal {A}\), a finite set of agent actions; \(\mathcal {R}\), a reward function \(\mathcal {R}: \, \mathcal {S} \times \mathcal {A} \times \mathcal {S} \, \rightarrow \, \mathbb {R}\) that defines the immediate—possibly stochastic—reward an agent gets for taking action \(a \in \mathcal {A}\) in state \(s \in \mathcal {S}\), and transitioning to state \(s' \in \mathcal {S}\) thereafter; \(\mathcal {P}\), a transition function \(\mathcal {P}: \, \mathcal {S} \times \mathcal {A} \times \mathcal {S} \, \rightarrow \, [0,1]\) that defines the probability of transitioning to state \(s' \in \mathcal {S}\) after taking action \(a \in \mathcal {A}\) in state \(s \in \mathcal {S}\); and finally, \(\gamma \in [0,1]\), a discount factor of future rewards [29].

In these settings, the agent’s goal is to maximize its long-term expected reward \(G_t\), given by the infinite sum \(\mathbb {E} [r_{t+1} + \gamma r_{t+2} + \gamma ^2 r_{t+3} + ... + \gamma ^n r_{t+n+1}]\). Solving an MDP ideally means finding an optimal policy \(\pi _*: \, \mathcal {S} \, \rightarrow \, \mathcal {A}\), i.e., a mapping that yields the best action to be taken at each state [29].

4.2 Multi-Agent Reinforcement Learning (MARL)

One critical difference between RL and MARL is that, instead of the environment transitioning to a new state as a function of a single action, it does so as a function of the combined efforts of all agents.

The MDP counterpart in MARL is the Markov Game (MG) [19] also known as Stochastic Game, and it is defined by a tuple \(\langle \mathcal {N}, \mathcal {S}, \{ \mathcal {A}^i \}_{i \in \mathcal {N}}, \{ \mathcal {R}^i \}_{i \in \mathcal {N}}, \mathcal {P}, \gamma \rangle \), where \(\mathcal {N} = \{ 1, ..., N \}\) denotes the set of \(N > 1\) agents, \(\mathcal {S}\), a finite set of environment states, \(\mathcal {A}^i\), agent’s i set of possible actions. Let \(\mathcal {A} = \mathcal {A}^1 \times ... \times \mathcal {A}^N \) be the set of agents’ possible joint actions. Then \(\mathcal {R}^i\) denotes agent’s i reward function \(\mathcal {R}^i: \, \mathcal {S} \times \mathcal {A} \times \mathcal {S} \, \rightarrow \, \mathbb {R}\) that defines the immediate reward earned by agent i given a transition from state \(s \in \mathcal {S}\) to state \(s' \in \mathcal {S}\) after a combination of actions \(a \in \mathcal {A}\); \(\mathcal {P}\), a transition function \(\mathcal {P}: \, \mathcal {S} \times \mathcal {A} \times \mathcal {S} \, \rightarrow \, [0,1]\) that defines the probability of transitioning from state \(s \in \mathcal {S}\) to state \(s' \in \mathcal {S}\) after a combination of actions \(a \in \mathcal {A}\); and \(\gamma \in [0,1]\), a discount factor on agents future rewards [32].

5 Centralized Norm Enforcement in MARL

Here, we propose a norm-enhanced Markov Game (neMG) for governing mixed-motive MGs by making use of an RL regulator agent and some added normative concepts. The proposal builds upon regular mixed-motive MGs. It involves enhancing the environment’s states with the ADICO information introduced in Sect. 3. The regulator is then able operate within this new ADICO information, which is also available for other agents in the game and can be considered for decision-making.

The method comprises two types of RL agents: \(N > 1\) players and one regulator. Players are simple RL agents, analogous to the ones that interact with regular versions of MARL environments. These agents could be modeled as average self-interested RL agents with off-the-shelf architectures such as A2C [21]—which facilitates the engineering side.

The regulator, in turn, is able to operate on the environment’s norms represented by the ADICO five dimensions; it can modify one or more dimensions at every period—a period consists of m time steps, m being a predefined integer value. This agent senses the state of the environment through a social metric—i.e. a system-level diagnostic—and the efficacy of its actions is signaled back by the environment based on the social outcome of past institutions. The regulator can also be modeled as a self-interested agent with off-the-shelf RL architectures.

Definition 1

A norm-enhanced Markov Game (neMG) can be formally defined by a 11-tuple \(\langle \mathcal {N}_p, \mathcal {S}_p, \{ \mathcal {A}_p^i \}_{i \in \mathcal {N}_p}, \{ \mathcal {R}_p^i \}_{i \in \mathcal {N}_p}, \mathcal {P}_p, \gamma _p, \mathcal {S}_r, \mathcal {A}_r, \mathcal {R}_r, \mathcal {P}_r, \gamma _r \rangle \), with \(\mathcal {N}_p, \mathcal {S}_p, \mathcal {A}_p^i, \mathcal {R}_p^i, \mathcal {P}_p, \gamma _p\) being the players’ original MG as per defined in Sect. 4.2. \(\mathcal {S}_r\), denotes the regulator’s set of states; \(\mathcal {A}_r\), the regulator’s set of actions; \(\mathcal {R}_r\), the regulator’s reward function \(\mathcal {R}_r: \, \mathcal {S}_r \times \mathcal {A}_r \times \mathcal {S}_r \, \rightarrow \, \mathbb {R}\) that determines the immediate reward earned by the regulator following a transition from state \(s_r \in \mathcal {S}_r\) to \(s'_r \in \mathcal {S}_r\) after an action \(a \in \mathcal {A}_r\); \(\mathcal {P}_r\), the regulator’s transition function \(\mathcal {P}_r: \, \mathcal {S}_r \times \mathcal {A}_r \times \mathcal {S}_r \, \rightarrow \, [0,1]\) that defines the environment’s probability of transitioning from state \(s_r \in \mathcal {S}_r\) to state \(s'_r \in \mathcal {S}_r\) after an action \(a_r \in \mathcal {A}_r\); and \(\gamma _r \in [0,1]\), the regulator’s discount factor.

In these settings, a neMG could be run following two RL loops; an outer one relative to the regulator, and an inner one relative to the players. Algorithm 1 exemplifies how these could be implemented.

figure a

6 Tragedy of the Commons Experiment

The method was tested on a mixed-motive environment that emulates the tragedy of the commons problem described by Hardin (1968) [14]. The tragedy of the commons describes a situation wherein a group of people shares a common resource that replenishes at a given rate. Every person has the own interest to consume the resource as much as possible, but if the consumption rate consistently exceeds the replenishment rate, the common soon depletes.

6.1 A neMG of a Tragedy of the Commons Environment

The environment built closely resembles that of Ghorbani et al. (2021) [12] and was built using both the OpenAI gym [3] and pettingzoo [30] frameworks. An episode begins with an initial quantity \(R_0\) of the common resource. Every n simulation steps—n being the number of agents; five for this simulation—the resource grows by a quantity given by the logistic function \(\varDelta R = rR(1 - \frac{R}{K})\), with \(\varDelta R\) being the amount to increase; r, the growth rate; R, the current resource quantity; and K, the environment’s carrying capacity—an upper bound to resources. For this experiment, r was set to 0.3, \(R_0\) is sampled from a uniform distribution U(10000, 30000), and K was set to 50000.

The environment also encodes the ADICO variables as described in Sect. 5. The A, D, and I dimensions remain fixed for this experiment since a) the norm applies to all players, b) the norm always defines a forbidden action, and c) players have only one action to choose from—they can only decide how much of the resource to consume and their rewards are proportional to their consumption. The C and O dimensions, on the other hand, may be changed by the regulator agent; i.e., every 100 steps the regulator may change how much of the resource a player is allowed to consume (l)—sampled at the beginning of each episode from a normal distribution N(375, 93.75)—and the fine applied to those who violate this condition (\(f(c,l,\lambda )\))—by setting the value of \(\lambda \), which is sampled at the beginning of each episode from a normal distribution N(1, 0.2). Thus the ADICO information that enhances this environment is made up of:

  • A: all players;

  • D: forbidden;

  • I: consume resources;

  • C: when consumption is greater than \(l_i\);

  • O: pay a fine of \(f = (c_i - l_i) \times (\lambda + 1)\), with \(c_i\) being the agent’s consumption in step i; \(l_i\) the consumption limit in step i; and \(\lambda \), a fine multiplier.

The fine is subtracted from the violator’s consumption in the same step the norm is violated.

Before a new institution is set, the regulator can evaluate the system-level state of the environment by observing how much of the resource is left, and a short-term and long-term sustainability measurement, given by \(S = \sum _{j=t-p}^{t} \frac{rp_j}{c_j}\) defined for \(c_j > 0\) and \(p \ge 0\), with p being the number of periods considered as short-term and long-term—respectively one and four for this simulation —; \(rp_j\), the total amount of resources replenished in period j; \(c_j\), the total consumption in period j; and t, the current period. At the end of the period, the success of past norms is feed-backed to the regulator by the environment as a reward value directly proportional to the last period’s total consumption.

At every simulation step, players in the environment can observe \(R_i\), \(l_i\), and \(\lambda _i\), and can choose how much of the resource to consume. An agent’s consumption may vary from 0 to \(c_{max}\), where \(c_{max}\) is a consumption limit that represents a physical limit in an analogous real-world scenario. Here, this value was set to 1500. An episode ends after 1000 simulation steps or when resources are depleted.

Agents in this simulation were built using traditional RL architectures—SAC [13] for the regulator and A2C [21] for the players—using the Stable Baselines 3 framework [28], and players were trained on a shared policy. The learning rates for all agents were set to 0.00039. A summary with all environment related variables used in this experiment and their values is presented in Table 1.

Table 1. Summary of the variables used in the experiment, their abbreviations, and values.

6.2 Results and Discussion

Figure 1 shows the average total consumption per episode over a 10 simulation run with and without the regulator agent acting on the environment. As predicted by the Nash equilibrium, we notice there isn’t much hope for generalized cooperation in case selfish agents are left playing the game by themselves—i.e. resources quickly deplete in the beginning of each episode.

Fig. 1.
figure 1

The total consumption per episode average over a 10 simulation run for the tragedy of the commons experiment. The green line shows the total consumption for when the regulator is active and the blue line for when it is inactive. The green shaded area covers the region one standard deviation above and below the mean for the simulation with the active regulator. (Color figure online)

Fig. 2.
figure 2

Time step consumption vs. consumption limit set by the regulator at an earlier episode a) and at a later episode b). The orange line shows the resource level at all time steps and the dotted red line shows the resource level in which the replenishment rate is greatest (25000). In a) players and the regulator act somewhat randomly and, for this reason, resources are kept at a sustainable range but consumption is sub-optimal. Players in b) learn to approximate their consumption to the norm-set consumption limit and the regulator learns to decrease such limit at times when resources are lower and increase it when resources are higher. Resources in this episode are still kept at a sustainable range and consumption sharply increases in comparison to a). (Color figure online)

Conversely, this is not the case when the regulator is put in place. After a short period of randomness at the beginning of the simulation, players learn not to consume from the resource since they frequently get punished when doing so. Around episode 300, players progressively learn to consume around as much of the resource as the set limit and the regulator increasingly learns to adjust such limit so as to keep resources at a sustainable level. A comparison between an episode at the beginning of a simulation and one at the end is shown in Fig. 2.

Every once in a while, the regulator overshoots by setting too big of a limit at the beginning of the episode and players quickly deplete the resource. This explains in parts the total consumption variation depicted in Fig. 1.

Note the system gets relatively close to an upper consumption benchmark by the end of the simulation—when agents’ combined consumption equals the maximum replenishment in every iteration. We can calculate this value by multiplying the maximum replenishment (3750) by the maximum count of replenishments in a given episode (200). In this case, the value is 750000 units of resource.

7 Conclusion

Delegating norm enforcement to an external central authority might seem counter-intuitive at first, as we tend to associate distributed solutions with robustness. It also might seem to go against the findings of Elinor Ostrom [25, 26], who showed that the collective action problem could be solved without the need of a regulatory central authority and for that, won the nobel prize in economics in 2009Footnote 5.

That being said, central regulation is still an important mechanism to govern complex systems. Many of the world’s modern social and political systems use it in some form or shape. With this work, we try to show that central regulation is also a tool that could be useful in governing MAS and MARL, especially when it is not desirable for actors in the system to punish each other.

Still, centralized norm enforcement brings about many other challenges that are not present in decentralized norm enforcement. For instance, if poorly designed (purposefully or not) the regulator himself, through the imposition norms and sanctions, may drive the system to socially bad outcomes. What if the designer behind the regulator does not have the good incentives? Constraints as such must be taken into consideration when judging the applicability of centralized norm enforcement in MASs.

As further work, we plan to test this very same method in other mixed-motive MARL environments.