Keywords

1 Introduction

Social norms have been considered to be an effective mechanism to facilitate coordination and cooperation in Multi-Agent Systems (MASs). How social norms can establish themselves automatically as an emerging process is a key problem in the research of social norms. It has been well recognized that learning from individual experience is a robust mechanism to facilitate emergence of social norms [1]. A great deal of previous work has thus studied emergence of social norms either through agent learning from random interactions in a well-mixed population of agents (e.g., [1, 2]), or through agent learning from local interactions with neighbors in networked agent societies (e.g., [37]).

Although these studies provide us with some valuable insights into general mechanisms behind emergence of norms achieved through agent learning behaviors, there exist several limitations in these studies inevitably. First, rich organizational characteristics such as hierarchical structure of the agent systems are not considered in these studies. For self-regulated MASs, however, these organizational characteristics can have significant impacts on opinion evolution and exchange of information among agents, thereby substantially influence the emergence of norms in the entire system. Thus, it is necessary to reflect these organizational characteristics in the local learning process of agents during the process of norm emergence. Second, most current studies investigate norm emergence in a small population and small action space. When the action space and population become large, subnorms may exist in the system and traditional models cannot form norms efficiently. In real-life human societies, however, social norms can always emerge efficiently in complicated organizations, ranging from enterprises to communities and nations. This is because in these organizations, there is some kind of centralized control governing the distributed interactions among agents. Therefore, it is necessary to integrate this kind of centralized control into distributed interactions among agents in order to facilitate the process of norm emergence, especially when the system is becoming more complex.

Based on the above consideration, this paper studies the influence of organizational structure on norm emergence, by proposing a hierarchical learning framework in networked MASs. In the framework, agents are separated into different clusters. Each agent in a cluster interacts with others using reinforcement learning methods, and reports its learning information to an upper layer supervisor in the cluster. After synthesizing all the interaction information in the cluster, the supervisor then generates a guide policy through learning from neighboring supervisors, and passes down this policy to subordinate agents in order to adapt their learning strategies heuristically. The main feature of the proposed framework is that through hierarchical supervision between subordinate agents and supervisors, a compromising solution can be made to elegantly balance distributed interactions and centralized control for norm emergence. Experiments show that learning from local interactions integrating hierarchical supervision can be an effective mechanism for emergence of social norms.

The remainder of the paper is organized as follows. Section 2 introduces social norms and reinforcement learning. Section 3 introduces the proposed learning framework. Section 4 presents experimental studies. Section 5 discusses related work. Finally, Sect. 6 concludes the paper with directions for future research.

2 Social Norms and Reinforcement Learning

In this study, we use the scenario of “rules of the road” [8, 9] as a metaphor to study norm emergence. This rule means that when we drive on the road, we would choose driving on the left side (L) or on the right side (R), and we aim to establish a norm that one side (L or R) will be chosen by all agents in the society. This rule can be viewed as a coordination game [9] in Table 1. There are two pure Nash-equilibria: both agents choose left or both agents choose right, and there is no preference on L or R in the coordination game.

Table 1. Payoff matrix of the coordination game

In the real applications of MASs, the number of actions available to the agents is an important factor in the emergence of social norms. Most current studies on norm emergence, however, only deal with a relatively simple norm space in which a norm is chosen from two possible action alternatives. In order to study norm emergence in a larger action space, we extend the coordination game into a more general form by considering \(N_a\) actions in the action space. In every time step, agent i and agent j choose action \(a_i\) and action \(a_j\), respectively. If \(a_i = a_j\), they can get immediate payoff of 1, and \(-\)1 otherwise.

Reinforcement learning algorithms [12] have been widely used for agent interactions in previous research of norm emergence. We focus on Q-learning algorithm in this paper, in which an agent makes a decision through the estimation of a set of Q-values. Its one-step updating rule is given by Eq. 1,

$$\begin{aligned} \mathrm{{Q(}}s,a\mathrm{{) = Q(}}s,a\mathrm{{) + \alpha _i\mathrm [r(}}s,a\mathrm{{) + }}\lambda \mathop {\max }\limits _{^{a'}} Q(s',a') - Q(s,a)] \end{aligned}$$
(1)

where \(\alpha _i \in (0,1]\) is a learning rate of agent i, \(\lambda \in [0,1)\) is a discount factor, \(\mathrm r(s, a) \) and Q(sa) are the immediate and expected reward of choosing action a in state s at time step t, respectively, and \(\mathrm Q(s', a')\) is the expected discounted reward of choosing action \(a'\) in state \(s'\) at time step \(t + 1\).

Q-values of each state-action pair are stored in a table for a discrete state-action space. At each time step, agent i chooses the best-response action with the highest Q-value based on the corresponding Q-value table with a probability of \(1-\epsilon \) (i.e., exploitation), or chooses other actions randomly with a trial-and-error probability of \(\epsilon \) (i.e., exploration).

3 The Proposed Learning Framework

3.1 Overview of the Hierarchical Learning Framework

Consider the society structure in a country. There is a government to manage citizens, and a senior government to manage lower level governments, and so on. In this scenario, the government is playing a role of supervisor. It supervises the state of subordinate citizens and educates them to act properly. Inspired from the structure of society, we propose a hierarchical network as shown in Fig. 1. The agent network is separated into a series of clusters \(C_x, x\in (1,2,\cdots ,X)\), where X is the number of clusters, and x denotes a supervisor for each cluster. Supervisors can be one of the subordinate agents or a dedicated agent. They are also connected to each other based on the lower network structure. Supervisors can collect learning information of their subordinate agents, and they can also interact with their neighbors. Similarly, a higher level of supervisors can be established based on the network of the first level of supervisors. In this way, we can establish several levels of supervisors until the population of supervisors in the top level is small enough to make a final decision efficiently. In this paper, we use two-level hierarchy networks for simplification.

Fig. 1.
figure 1

An example of a hierarchical network with three levels

figure a

The interaction protocol of our hierarchical learning framework is given in Algorithm 1. In each time step, each agent i chooses an action \(a_i\) with highest Q-value or randomly chooses an action with an exploration probability of \(\epsilon \) (Line 3). Then, agent i interacts with a neighbor j randomly selected from the set of neighbors N(i) and gets a payoff of \(r_i\) (Line 4). The action \(a_i\) and payoff \(r_i\) are reported to agent i’s supervisor x (Line 5), and supervisor x combines all actions of its subordinate agents into an action \(a_x\) (Line 8). Supervisor x then interacts with one of its neighbors and updates \(a_x\) based on the performance difference between the actions of supervision x and its neighbor (Line 9). This process will be presented in detail in the following part. When supervisor x has taken a final decision \(a_x\), it issues \(a_x\) to its subordinate agents (Line 10). Based on this information from supervisor x, agent i adjusts its learning rate \(\alpha _i\) or the exploration probability \(\epsilon _i\) (Line 13). Finally, each agent updates its Q-values using new learning rate \(\alpha _i\) (Line 14). This process is iterated for T time steps.

The main feature of the hierarchical learning framework is that there is hierarchical supervision governing over the distributed interactions among agents. As supervisors have a wider view than subordinate agents, they can direct their subordinate agents by learning from other supervisors. Through the feedback from supervisors, subordinate agents can understand whether their actions comply with the norms in the society, and thereby adjust their learning behaviors accordingly. By integrating supervisors’ central control into agents’ local learning processes, norms can be established more effectively and efficiently in the hierarchical learning framework, compared with the traditional social learning framework based on pure distributed interactions.

3.2 The Hierarchical Learning Mechanism

The Report Mechanism and Assemble Method. Each agent reports its action \(a_i\) and payoff \(r_i\) to its supervisor x. The supervisor aggregates the information into two tables \(QS_x\) and \(RS_x\). \(QS_x(a)\) means the overall acceptance of the action a in the cluster and \(RS_x\)(a) means the overall reward of action a in the cluster. \(QS_x(a)\) can be calculated by Eq. 2,

$$\begin{aligned} QS_x(a) = \sum \limits _{i \in C_x} s_i\end{aligned}$$
(2)

where \(s_i\) is an indicator function given as follows:

$$\begin{aligned} {s_i} = \left\{ {\begin{array}{*{20}{c}} 1\\ 0 \end{array}} \right. \begin{array}{*{20}{c}} {~~if~a_i = a,}\\ {~~otherwise.} \end{array} \end{aligned}$$
(3)

\(RS_x(a)\) can be calculated by Eq. 4,

$$\begin{aligned} RS_x(a) = \frac{1}{QS_x(a)}\sum \limits _{i \in C_x, a_i = a} r_i \end{aligned}$$
(4)

especially, if \(QS_x(a) = 0\), \(RS_x(a)\) is set to 0.

Each supervisor then gets its aggregated action \(a_x\). We use democratic voting mechanism in this paper, which means that \(a_x\) is simply the action most accepted by the cluster (randomly choose an action if there exists a tie). The aggregated action \(a_x\) can be given by Eq. 5,

$$\begin{aligned} a_x = arg~max ~QS_x(a) \end{aligned}$$
(5)

The Interaction Mechanism of Supervisors. After combining actions of the cluster, each supervisor generates an assemble action \(a_x\), which indicates the norm (i.e., action) accepted by the society. In order to generate a better supervision policy, each supervisor then interacts with a randomly chosen neighboring supervisor and learns from the neighbor by comparing the performance of their assemble actions. The motivation of this comparison comes from evolutionary game theory [13], which provides a powerful methodology to model how strategies evolve over time based on their performance. In this theory, strategies compete with each other and evolve while agents learn (mainly through imitation) from others with higher fitness. One of the widely used imitation rules is the proportional imitation, which can be given by Eq. 6 [14],

$$\begin{aligned} p = \frac{1}{{1 + {e^{ - \beta ({u_y} - {u_x})}}}} \end{aligned}$$
(6)

where p is a probability for supervisor x to imitate the neighbor y, \({u_x} = RS_x({a_x})\) means the fitness (average payoff of the assemble action in this paper) of supervisor x, \({u_y} = RS_y({a_y})\) means the fitness of y, and \(\beta \) is a parameter to control the tensity of selection.

In Eq. 6, probability p can be large when \(u_x\) is smaller than \(u_y\). Every supervisor x then converts \(a_x\) into \(a_y\) as the final assemble decision with probability p, or remains \(a_x\) with probability \(1-p\). In this way, supervisors imitate better actions taken by their neighbors to improve the behaviors of their clusters.

Strategy Adaption of Subordinate Agents. How to adjust learning strategies of subordinate agents using the guide information from supervisors is the key problem in the hierarchical learning framework. Based on the guide information from supervisors, subordinate agents can realize whether they have complied with the norm in the society and adjust their learning behaviors accordingly. To explicitly evaluate the agents’ performance in terms of complying with the norm, we borrow the concept of WoLF (Win-or-Learn-Fast) principle [15], which is a well-known mechanism in multiagent learning, to the hierarchical learning framework. Each subordinate agent i compares \(a_x\) with its action \(a_i\). If \(a_i = a_x\), we say that agent i is “winning”, and “losing” otherwise. Based on the “winning” or “losing” results, two fundamental learning parameters (i.e., learning rate \(\alpha \) and exploration probability \(\epsilon \)) can be adapted as follows.

Algorithm 2 shows the mechanism of adapting \(\alpha \), which is denoted as hierarchical learning- \(\alpha \). Notation s is the changing direction of learning rate. If \(a_i = a_x\), the action chosen by agent i is approved by its supervisor. Agent i then decreases its learning rate \(\alpha _i\) to maintain the “winning” state. However, if \(a_i \ne a_x\), agent i increases the learning rate \(\alpha _i\) to learn more from experience. Being more adaptive may bring about a better choice for the next round play.

Similarly, Algorithm 3 shows the mechanism of adapting \(\epsilon \), which is denoted as hierarchical learning- \(\epsilon \). When \(a_i = a_x\), agent i decreases its exploration probability \(\epsilon _i\) to maintain the “winning” state and reduces the mistake probability because of exploration. However, if \(a_i \ne a_x\), agent i increases the exploration probability \(\epsilon _i\) to explore more actions to get rid of the “losing” state.

The third approach is to combine the above two approaches (i.e., adapting \(\alpha \) and \(\epsilon \) at the same time), which is denoted as hierarchical learning-both.

figure b
figure c

4 Experiment and Results

In this section, experiments are carried out to demonstrate the performance of our framework. We compare the hierarchical learning method with the conventional social learning method in [10], and explore the influence of some parameters on norm emergence under the proposed framework.

4.1 Experimental Setting

In this paper, we test the hierarchical learning framework on a \(10*10\) grid network. This framework consists of two levels: subordinate agents and supervisors. The network is separated into some \(4*4\) clusters unless otherwise specified. In Q-learning algorithm, we consider each state as the same, which means that there is no state transition. Each agent can choose from 4 actions as default. Parameters \(\alpha \) and \(\epsilon \) of each agent are initially set as 0.1 and 0.01, respectively. Moreover, parameter \(\beta \) to control the tensity of selection in Eq. 6 and parameter \(\gamma \) in Algorithms 2 and 3 are both 0.1. Unless specified otherwise, the final results are averaged over 100 Monte-Carlo runs.

4.2 Results and Analysis

Convergence of Social Norms. Previous work has shown that norms can always emerge if agents interact with a randomly chosen neighbor. We compared the performance between the approach based on traditional social learning framework [10] and the three approaches based on our hierarchical learning framework. Figure 2 gives the learning dynamics of the four different learning approaches. From the results, we can see that, with both cases of social learning and hierarchical learning approaches, the average reward of the system increases and then keeps stable, which means the social norms emerge with all these four approaches. The convergence level and efficiency, however, are distinguishable among these approaches. The hierarchical learning approaches (especially hierarchical learning-\(\epsilon \) and hierarchical learning-both) are slower at first but finally reach a higher convergence level than the social learning approach. Moreover, hierarchical learning-\(\alpha \) emerges norms much faster than the other two hierarchical learning approaches. These results indicate that with the guide information of supervisors, agents can know more about the society, and the hierarchical mechanism is an effective way to remove subnorms in MASs to reach a higher level of norm emergence. Hierarchical learning-\(\alpha \) is more efficient than the other two approaches because when agents adapt learning strategy with hierarchical learning-\(\epsilon \) or hierarchical learning-both, the agents have to spend more time on exploring actions at first. But finally, when \(\epsilon \) reduces to nearly 0, the agents can choose a correct action more firmly than other approaches.

Fig. 2.
figure 2

The comparison of different learning approaches.

Table 2 gives the detailed performance of the different approaches in 1000 runs. In order to better demonstrate the different performance of these approaches, we also include the results when 100 % agents have chosen the same action as the norm. Achieving 100 % norm emergence is an extremely challenging issue due to the widely recognized existence of subnorms. To calculate the average steps needed for norm emergence, we take 10000 episodes as the upper limit in this run when a norm fails to emerge in 10000 episodes. The average reward, success rate (i.e., how many runs a norm can emerge successfully in the 1000 runs) and the speed (i.e., time steps needed to emerge a norm) are compared. We can see that hierarchical learning approaches are obviously more efficient and effective than traditional social learning in all these three criteria. Although hierarchical learning-\(\alpha \) can emerge a norm more quickly than the other two approaches (refer to Fig. 2), the average steps to emerge a norm in hierarchical learning-\(\alpha \) are larger because of a small probability of failing to emerge a norm (i.e., 10000 steps are used for calculation when a norm fails to emerge).

Table 2. Performance of different learning approaches (averaged over 1000 runs)

Dynamics of Learning Rate \(\alpha \) and Exploration Probability \(\epsilon \) . We studied the dynamics of the exploration probability \(\epsilon \) and the learning rate \(\alpha \) with different action sizes in Figs. 3 and 4 (dynamics of \(\alpha \) and \(\epsilon \) with hierarchical learning-both overlap with each other because of the identical update method). We can see that in both cases, exploration probability \(\epsilon \) and learning rate \(\alpha \) rise at first but then decline to 0 afterwards. When there are only 2 actions, both \(\alpha \) and \(\epsilon \) rise to a small extent, and then decline to nearly 0 quickly. When the action number increases to 4, it still takes a short period of time to reduce \(\alpha \) to 0 using hierarchical learning-\(\alpha \). However, it takes longer time for the values to decline to 0 using hierarchical learning-\(\epsilon \) and hierarchical learning-both approaches.

Fig. 3.
figure 3

Dynamics of \(\epsilon \) and \(\alpha \) (2 actions).

Fig. 4.
figure 4

Dynamics of \(\epsilon \) and \(\alpha \) (4 actions).

The dynamics of \(\alpha \) and \(\epsilon \) indicate that the agents do not know what they ought to do and therefore choose actions randomly at first. As the norm is emerging, the agents can realize which action should be the norm and thus reduce the exploration probability \(\epsilon \) and the learning rate \(\alpha \) accordingly. Hierarchical learning-\(\alpha \) is more adaptive in the scenario of a larger action space. This is because agents can find the correct action easily when \(\epsilon \) is increased (agents are more likely to choose the other action with lower Q-value which is correct) in a two-action scenario. However, in a multi-action scenario, the probability to find the correct action is reduced with the increasing of exploration probability \(\epsilon \), and failing to find a correct action can lead to a further increase of \(\epsilon \). As a result, agents may be trapped in a vicious circle. But if we choose the mechanism of adapting learning rate \(\alpha \), the agents can learn more experience from unsuccessful interactions, which accordingly speeds up the learning process.

Influence of Number of Actions. The emergence of norms with a large action space is a difficult problem in the field of norm emergence. Figures 5 and 6 show the comparison of hierarchical learning approach (i.e., hierarchical learning-\(\alpha \)) and conventional social learning approach. We can see that the hierarchical learning approach performs better than the social learning approach in terms of a faster convergence speed and a higher convergence level. This distinction is more apparent when the action space becomes larger. This is because the agents can receive support from the wide view of their supervisors and get access to the state of the whole society. With this information they can understand whether they have complied with the norm in the society and decide to learn/explore more or to keep their state. It is the process of imitating others with higher rewards that reduces the time to explore unnecessary actions. As a result, the norm can emerge efficiently and subnorms can be removed to a large extent.

Fig. 5.
figure 5

Influence of number of actions with hierarchical learning-\(\alpha \).

Fig. 6.
figure 6

Influence of number of actions with traditional social learning.

Influence of Population Size and Cluster Size. The influence of population size on norm emergence is presented in Fig. 7. We can see that efficiency is nearly unaffected by different population sizes, but effectiveness (i.e., the level of emergence) is obviously reduced in a larger population. This is because in the hierarchical learning framework, with the same size of clusters, each supervisor can have a wider view when the population size gets smaller. As a result, each supervisor can have a more powerful control force over the cluster comparatively, and the system tends to be more close to centralized control, which leads to higher efficiency. Similarly, Fig. 8 shows how the cluster size influences the level of emergence with the same size of population. As we can see, a larger cluster size can result in a higher level of emergence. This is because when the cluster size is larger, each supervisor can have a more powerful control force over the cluster comparatively, and therefore the norm can emerge more efficiently.

Fig. 7.
figure 7

Influence of population size when separated into \(4*4\) clusters.

Fig. 8.
figure 8

Influence of cluster size in a \(30*30\) grid network.

5 Related Work

Many studies have been done to investigate the mechanisms of norm emergence in the literature. Shoham and Tennenholtz [8] proposed a rule based on the highest cumulative reward to study the emergence of rule. Sen et al. [10] presented an emergence model, in which individuals learn from another agents from the group randomly. Then Sen et al. [2] extended this model by assessing the influence of heterogeneous agent systems and space-constrained interactions on norm emergence. Savarimuthu [1, 16] recapped the research on the multiagent-based emergence mechanisms, and presented the role of three proactive learning methods in accelerating norm emergence. Airiau and Sen [11, 17] extended group-based random learning model [10] to complicated networks to study the influence of topologies on norm emergence. Villatoro et al. [18] proposed a reward learning mechanism based on interaction history. Later, they established two rules (i.e., re-building links with neighborhood and observing neighbors’ behaviors) to overcome the subnorm problems [7, 19]. Mihaylo [20] came up with a learning mechanism of Win-Stay Lose-Probabilistic-Shift to make 100 % emergence of norms in complicated networks. Mahmoud et al. [21] further extended Axelrod’s seminal model by considering topological structures among agents. Unlike these studies, which focused on norm emergence either in a fully mixed agent population or in a network agent society, our work captures the organizational characteristic of hierarchical supervision among learning agents in networked MASs. This feature sets our work apart from all these existing studies.

From another perspective, the problem that has been addressed in this paper is to employ heuristics to guide the policy search to speed up the multi-agent learning process. There is plenty of work in the research of multi-agent learning to properly address this problem. For example, Zhang et al. [22] defined a multi-level organizational structure for automated supervision and a communication protocol for exchanging information between lower-level agents and higher-level supervising agents, and Bianchi et al. [23] raised Heuristically Accelerated Minimax-Q (HAMMQ) and incorporated heuristics into the Minimax-Q algorithm to speed up convergence rate. Our work is different from the above studies in terms of targeting a different problem (i.e., norm emergence in networked MASs). Nevertheless, the main principle embodied in the proposed framework can shed some light on understanding the learning dynamics in MAL during an efficient convergence to a predefined (i.e., Nash) equilibrium.

6 Conclusions and Future Work

Norm emergence is a crucial issue in understanding and controlling complex systems. Assuming a fully centralized controller to govern the process of norm emergence is not only infeasible for large systems, but also expensive in terms of manageable cost. Therefore, it is more applicable for a norm to emerge on its own through agents’ local interactions. This kind of pure distributed way of norm emergence, however, has another limitation in terms of low efficiency especially when the system becomes complex. The hierarchical learning framework proposed in this paper makes a balance between centralized control and distributed interactions during the emergence of social norms by integrating hierarchical supervision into distributed learning of agents. Experiments have indicated that this compromising solution is indeed robust and efficient for evolving stable norms in networked systems, especially when the norm space is large.

Much work remains to be done in the future. For example, the hierarchical learning framework can be conducted in some complex networks, like the small-world networks and the scale-free networks, and the influence of network structure on norm emergence can be explored in depth. Moreover, in this work, we use two-level network for experiments, and it is promising that multilevel networks can lead to a more efficient emergence of social norms.