Keywords

1 Introduction

The pervasiveness of computing and networking fosters applications backed by large-scale cyber-physical collectives—cf. edge-fog-cloud infrastructures, robot swarms, and smart ecosystems. Combined with the autonomic computing vision  [18], which promotes autonomy and self-* capabilities in engineered systems, there is an increasing trend towards Collective Adaptive Systems (CASs) and their engineering [9, 31]. CASs are characterised by a multitude of agents that can produce globally coherent results (emergents [43]), and collective-level adaptivity to environment change via local decision-making and decentralised interaction. The engineering of CASs is an open research problem [19, 31] of significance, tightly linked with the problems of “steering” self-organisation and “controlling” emergence to promote desired while avoiding undesired emergents [29]. In general, when dealing with CASs, there are two distinct problems: (i) given an initial system state and local behavioural rules, predicting what global outcomes will be produced (forward, prediction, or local-to-global problem); and (ii) what local behavioural rules must be assigned to the system devices to achieve certain global outcomes (inverse, control, or global-to-local problem). These two problems provide corresponding perspectives for designing CASs. In particular, the latter perspective has promoted research on spatial and macro-programming [7, 10] aiming at expressing programs in terms of the desired global outcome and leaving the underlying platform to deal with the global-to-local mapping.

In this work, we consider Aggregate Computing (AC) [8], a prominent field-based coordination approach [41] promoting macro-programming by capturing CAS behaviours as functions operating on computational fields [41], in a system model of neighbour-interacting devices operating in asynchronous sense-compute-interact rounds. A computational field is a macro-abstraction that maps a set of devices over time to computational values. AC is based on the Field Calculus (FC) [41], or variants thereof, that define constructs for manipulating and evolving fields. So, CAS behaviour can be expressed by a single aggregate program (global perspective) that also defines what processing and communication activities must be performed by each individual device (local perspective).

Besides the programming model and its implications, a significant portion of research on AC [41] has focussed on design and analysis of coordination algorithms expressed in FC for efficiently carrying out self-organising behaviours like, e.g., computing fields of minimum distances from sources (gradients) [4, 24, 30], electing leaders [27], or distributed summarisation [3]. However, devising self-organising coordination algorithms is not easy; especially difficult is identifying solutions that are efficient across environment assumptions, configurations, and perturbations. The difficulty lies in determining, for a current context, the local decisions of each device, in terms e.g. of processing steps and communication acts, producing output fields that quickly converge to the correct solution.

In this work, we start what we consider a key future research thread of field-based coordination, i.e., the study of Machine Learning techniques to improve existing AC coordination algorithms. Specifically, we adopt a Reinforcement Learning (RL)–based approach—where an agent learns from experience how to behave in order to maximise delayed cumulative rewards [38]. We devise a general methodology that somewhat resembles the notion of sketching in program synthesis [36]: a template program is given and holes are filled with actions determined through search. In our case, the program is the AC specification of a coordination algorithm, and holes are filled with actions of a policy learnt through Hysteretic Q-Learning [25]. We consider the case of the classic gradient algorithm, a paradigmatic and key building block of self-organising coordination [7, 10, 41]: we show via simulations that the system, after sufficient training, learns an improved way to compute and adjust gradient fields to network perturbations.

In the rest of the paper: Sect. 2 offers background on AC and RL; Sect. 3 discusses the integration of RL and AC, and the use of the approach to improve the basic gradient algorithm; Sect. 4 provides an experimental evaluation of the proposed approach; Sect. 5 summarises results and future research.

2 Background

In this work, we contribute to field-based coordination [23, 24, 40, 41, 45], a well-known nature-inspired approach [32] to coordination that exploits computational mechanisms inspired by the force fields of physics. Use of artificial force fields for navigation and obstacle avoidance has been explored since the 90s [45]. In co-fields [24], fields produced by agents or the environment are used to drive the activities of agent swarms. In TOTA (Tuples on the air) [23], tuples spread over the network to model dynamic distributed data. In Aggregate Computing [8, 41], the field-based coordination approach we consider in this work, fields are used as an abstraction for functionally expressing CAS behaviour.

We recap Aggregate Computing in Sect. 2.1. Then, we review RL in Sect. 2.3, to prepare the ground for our contribution, Reinforcement Learning-based Aggregate Computing (Sect. 3).

2.1 Aggregate Computing

Aggregate Computing [8, 41] is a paradigm for CAS programming. The approach generally assumes a system model of neighbour-interacting devices that work at asynchronous rounds of sense-compute-act steps. On such an execution model, self-organising collective behaviour is expressed in terms of functional manipulations of computational fields [41]: maps from devices to values. A field can denote, e.g., what different devices sense from the environment, or the outputs of their computations. The Field Calculus (FC) [41] is a core functional language that captures the key constructs needed to properly manipulate fields in order to express collective adaptive computations; they cover state evolution, communication, and computation branching. The main benefit of AC/FC is its compositionality: the ability to abstract collective adaptive behaviours into reusable functions that can be composed together to build more complex behaviours.

The FC is implemented by aggregate programming languages such as ScaFi (Scala Fields) [14]. So, in practice, developing a CAS using this paradigm amounts to: (i) writing an aggregate program using e.g. ScaFi; (ii) setting up an AC middleware (for simulation or concrete distributed systems) to handle the scheduling of computations and communications; (iii) deploying and configuring the middleware and the program on a network of nodes. The approach has proven effective to implement various kinds of coordination services [15] and self-* applications in domains like crowd management [8], swarm robotics [11], and smart cities [12]—see [41] for a recent review.

In the following, we summarise the computation model and the FC/ScaFi language, which are essential to understand the contribution and case study.

System Model. For an aggregate program to yield collective adaptive behaviour, ongoing computation and communication are needed. An individual atomic step of a device is called a round and consists of the following:

  1. 1.

    context acquisition: the device collects information from the sensors, and the most recent messages received from each neighbour (including the device itself—to model state);

  2. 2.

    program evaluation: the aggregate program is evaluated against the acquired context, yielding an export, namely a message to be sent to neighbours for coordination purposes and that is implied by the use of communication constructs in the aggregate program;

  3. 3.

    export sharing: the export is sent to the neighbours;

  4. 4.

    actuations: the export also includes information that can be used to drive actuators in the local node.

Rounds execution is completely asynchronous: there is no global clock or barrier to coordinate the aggregate. Scheduling of rounds might be periodic or reactive [34], and messages from neighbours are assumed to be retained for some configurable amount time. Such asynchrony, combined with local interaction, promotes scalability. The combination of the aggregate program logic and such a collective and periodical execution promotes the emergence of globally coherent results.

Field Calculus. The main constructs that capture the essential aspects for programming self-organising systems with FC are:

  • Stateful field evolution | expression describes a field evolving in time. \(e_1\) is the initial field value and the function \((x) => e_2\) defines how the field changes round by round substituting (x) with the value of the previous computed field (at the beginning \((x) = e_1\)).

  • Neighbour interaction | expression involves evaluation of e, sharing of the corresponding local value with neighbours, and observation of the neighbours’ evaluations of e. Then, operators can be used to locally reduce such neighbouring fields to values. For instance, for a device, returns the minimum value of e found in its neighbourhood.

  • Domain partitioning | expression splits the computational field into two non-communicating domains hosting isolated sub-computations: \(e_1\) where \(e_0\) is true, and \(e_2\) where \(e_0\) is false.

Full coverage of FC/ScaFi is beyond the scope of this paper. A more comprehensive presentation is given in [41]. As a key example, we introduce the gradient algorithm, which will be considered Sect. 3 and 4.

2.2 The Gradient Building Block

A gradient [4, 24, 30] is a field mapping each device in the system with its minimum distance from the closest source device. A (self-healing) gradient algorithm is one that computes a gradient field and automatically adjusts it after changes in the source set and the connectivity network. This algorithm is important as it often recurs as part of higher-level self-organising algorithms, such as information flows [44], distributed data collection [3], and regional network partitioning [13]. A simple implementation, which we call classic gradient, can be expressed in FC as follows:

figure f

where is 0-ary function that evaluates the distance between two neighbours; is a conditional expression selector which evaluates all its arguments and selects if is true or otherwise; and selects the minimum of its argument across the neighbourhood without considering the contribution of the device itself. A repeated evaluation of this program will self-stabilise to the field of minimum distances from the sources, i.e., it will eventually converge to the ideal gradient field once the inputs and topology stop changing.

Though working and self-stabilising, this algorithm suffers some problems [4]. One is the rising value problem (also known as count to infinity): due to the repeated minimisation of all the contributions, the system handles well the situations when the output needs to drop (e.g. a new source enters the system) but it instead reacts slowly when the output needs to rise (e.g. a source node turns off). In literature, several heuristics are proposed to tackle the problems of the classical gradient [4]. One of them is CRF (Constraint and Restoring Force) [6]. Its goal is to deal with the problem by enforcing a constant rising speed when nodes recognise a local slow rising of the gradient field. To this end, each node is affected by a set of constraints (i.e. nodes that have a lower gradient value). If a node finds that it is slowly rising (i.e. there are no more constraints) then it increases its output at a fixed velocity, ignoring its neighbours. Otherwise, the output of the gradient follows the classical formula.

2.3 Reinforcement Learning

Reinforcement Learning [38] is a generic framework used to structure control problems. The focus is on sequential interactions between agents (i.e. entity able to act) and an environment (i.e. anything outside the control of agents). At each discrete time step t, an agent observes the current environment state \(s_t\) (i.e. the information perceivable by an agent) and selects an action \(a_t\) through a policy \(\pi \) (i.e. a probabilistic mapping between state and action). Partly as a consequence of its action, at the next time step \(t+1\) the agent finds itself in state \(s_{t+1}\) and receives a reward \(r_{t+1}\), i.e. a scalar signal quantifying how good the action was against the given environment configuration. The goal of RL is to learn a policy \(\pi ^*\) that maximises the long-term return G (i.e. the cumulative reward) through a trial-and-error process. Different problems can be devised in these settings, like video games [2], robotics [20], routing [22], etc.

This general framework is supported by Markov Decision Process (MDP), a mathematical model that describes the environment evolution in sequential decision problems. A MDP consists of a tuple \(<\mathcal {S}, \mathcal {A}, \mathcal {P}, \mathcal {R}>\) in which:

  • \(\mathcal {S}\) denotes the set of states;

  • \(\mathcal {A}\) is the set of actions;

  • \(\mathcal {P}(s_{t + 1} | s_t, a_t)\) define the probability to reach some state \(s_{t + 1}\) starting from \(s_t\) and performing \(a_t\) (i.e. transition probability function);

  • \(\mathcal {R}(s_t, a_t, s_{t+1})\) devise a probabilistic reward function.

In MDP, \(\mathcal {R}\) is memory-less, namely the next environment state depends only on the current state. Typically, in RL problems, agents do not have access to \(\mathcal {R}\) or \(\mathcal {P}\) but they can rely only on the experience \((s_t, a_t, r_t)\) sampled at a time step. Therefore, G is defined as the discounted sum of reward a possible future trajectory \(\tau \) (i.e. a sequence of time steps):

$$\begin{aligned} G_{t} = r_t + \gamma r_{t + 1} + \gamma ^2 r_{t + 2} + \dots + \gamma ^T r_{t + T} = \sum _{k = t}^T \gamma ^{k-t} r_k \end{aligned}$$
(1)

where \(0 \le \gamma \le 1\) is the discount factor, that is how much the future reward impacts the long-term return. Finally, the RL goal can be expressed as the maximisation of the expected long-term return following a policy \(\pi \):

(2)

The RL algorithms classification depends on how we derive the \(\pi ^*\) (i.e. the optimal policy) according to J. In particular, value-based methods learn one further function (\(Q^\pi \) or \(V^\pi \)) to derive \(\pi ^*\). \(V^\pi \) is the value function that evaluates how good (or bad) a state is according to the long-term return following the policy \(\pi \) (expected value). It is defined as:

$$\begin{aligned} V(s)^\pi = \mathbb {E}_\pi \Big [ G_t | s_t = s \Big ] \end{aligned}$$
(3)

\(Q^\pi \) is the corresponding value fuction that evaluates state-action pairs:

$$\begin{aligned} Q(s, a)^\pi = \mathbb {E}_\pi \Big [ G_t | s_t = s, a_t = a \Big ] \end{aligned}$$
(4)

Policies could be defined through value functions. In particular, a greedy policy based on Q function is the one that always chooses the action with the highest value in a certain state: \(\pi (s) = \arg \max _{a}(Q(s, a))\).

Q-Learning [42] is one of the most famous value-based algorithms. It aims at finding the \(Q^*\) (i.e. the Q function associated with \(\pi ^*\)) by incrementally refining a Q table directly sampling from an unknown environment. Particularly, this is done through a temporal difference update performed at each time step:

$$\begin{aligned} Q(s_t, a_t) = Q(s_t, a_t) + \alpha * [r_t + \gamma * \arg \max _{a}(Q(s_{t+1}, a)) - Q(s_t, a_t)] \end{aligned}$$
(5)

where \(\alpha \) is the learning rate (i.e. how much new information will influence the learned Q at each update). The agent typically follows a \(\epsilon \)-greedy policy (behavioural policy, the function chooses a random action with a \(\epsilon \) probability) to balance the exploitation and exploration trade-off. Using \(Q^*\) we could extract the \(\pi ^*\) greedy policy (target policy).

Nowadays, Q-Learning is applied in various fields, ranging from robotics to wireless sensor networks and smart grids [17]. However, one of the most challenging settings in which Q-Learning could be applied is when the learning process has to deal with multiple concurrent learners, namely a multi-agent system.

Multi-agent Reinforcement Learning. RL was originally proposed as a framework to control a single agent. However, in CASs we are interested in many agents that interact in a common environment. The study of learning in these settings is known as Multi-Agent Reinforcement Learning (MARL) [39].

A straightforward way to apply RL algorithms to multi-agent settings, called independent learning (IL) approach [39], consists in deploying a learning process for each agent and considering other agents as part of the environment. However, applying single-agent algorithms as-is would probably lead to bad results, due to the non-stationarity of the environment induced by concurrent learning , and stochasticity [26]. Therefore, different algorithms are proposed to handle those issues in IL settings, such as Hysteretic Q-Learning [25] and Distributed Q-Learning (DQL) [21]. Hysteretic Q-Learning (HQL) aims at managing the stochasticity problems exploiting an optimistic heuristic.The idea is to give more importance to good actions than bad actions – more frequent due to the concurrent learning – reducing the policies oscillation during the learning. To this aim, HQL introduces two learning rates (\(\alpha \) and \(\beta \)) to weigh the increase and decrease of Q values. The update equation becomes:

$$\begin{aligned} \delta (t) = r_t + \gamma \, \arg \max _{a}(Q(s_{t + 1}, a)) - Q(s_t, a_t) \end{aligned}$$
(6)
$$\begin{aligned} Q(s_t, a_t) = {\left\{ \begin{array}{ll} Q(s_t, a_t) + \alpha \, \delta (t) &{} \text {if } \delta \ge 0 \\ Q(s_t, a_t) + \beta \, \delta (t) &{} \text {else} \end{array}\right. } \end{aligned}$$
(7)

In this study (cf. Sect. 4), we use HQL as reference RL algorithm since it has a strong empirical track for cooperative multi-agent systems [5, 26, 46]. Moreover, although structurally similar to DQL, it can be used in non-deterministic MDPs—a typical setting for CASs. Indeed, DQL is a particular case of HQL where \(\beta = 0\). However, in doing so, the agents tend to overestimate the Q value due to the optimistic settings because they do not consider the environment noise.

RL could also be used in multi-agent settings through a central controller that learns exploiting system-wide information. The learning problem thus becomes a single agent one in which the agent consists of the Cartesian product of all the system nodes. However, this solution cannot be applied to CASs, due to the many-agent settings, openness, and the lack of a central entity. Novel approaches are structured in between independent learners and centralised learners techniques—the so-called Centralised Training Decentralised Execution (CTDE) [16]. In CTDE, a central agent performs the learning process and generates distributed controllers, one for each agent. CTDE is typically applied in offline learning through the use of simulations. This way, it is possible to leverage global information at simulation time, but then the controllers are completely independent of this central authority. So, when the learner finds a good policy, it can be removed from the system.

Although CTDE allows RL to be used in environments with many agents, the learner must consider a large population of agents at simulation time, leading to sample inefficient algorithms. A solution to this problem in similar CASs (e.g. swarm robotics [37]) is to add another constraint, namely the agents’ homogeneous and cooperative behaviour [33]. With this assumption, the learner only has to find a single strategy that applies to the whole system.

3 Reinforcement Learning-based Aggregate Computing

3.1 On Integrating Machine Learning and Aggregate Computing

As anticipated in Sect. 2.1, the behaviour of an aggregate system depends on the interplay of three main ingredients: (i) the aggregate program, expressing conceptually the global behaviour of the entire system, and concretely the local behaviour of each individual node in terms of local processing and data exchange with neighbours; and (ii) the aggregate execution model, promoting a certain dynamics of the system in terms of topology management (e.g., via neighbour discovery), execution of rounds, scheduling of communications; and (iii) the environment dynamics. While the latter cannot be controlled, the importance of the first element is reflected by research on the design of novel algorithms (cf. [4, 41]), while the second element is studied w.r.t. the possibility of tuning and adaptivity according to available technology and infrastructure or the dynamics of the environment. Since tuning programs or execution details to specific environments or adapting those to changing environments can be burdensome, it makes sense to consider the use of Machine Learning techniques to let a system learn effective strategies for unfolding collective adaptive behaviour.

3.2 Aggregate Programs Improvement Through RL

Fig. 1.
figure 1

Integration of RL within the AC control architecture [11]. The RL state and reward concepts build upon the context, given by environment and neighbour data. The designer configures action points where learning can improve the aggregate computation. The actions selected by the learned policies will then affect the environment (via actuators) and neighbours (via outbound messages).

In this work, we focus on improving aggregate programs by learning effective local actions within a given algorithmic schema—an approach similar to the sketching technique for program synthesis [36]. As a learning framework, we use RL as we deal with systems of agents performing ongoing activities by interacting with one another and the environment. From a local viewpoint, an AC device is an agent acting based on the local context it perceives (sensor readings, device state, and messages from neighbours), and this matches the RL execution model very well.

So, our long-term goal is to integrate RL into the AC “stack” (i.e., across the platform, language, and library levels) to improve the collective behaviour defined by ScaFi aggregate programs in terms of efficiency (i.e., reducing the resource usage maintaining the same functional interface), efficacy (i.e., synthesising more stable and faster converging behaviours) and adaptability (i.e., the same program works against different environments).

As a first contribution, summarised in Fig. 1, in this work we integrate RL within the AC control architecture in order to support learning of good collective behaviour sketched by a given aggregate program. Specifically, we focus on improving AC building blocks (such as the gradient algorithm covered in Sect. 2.2) through learning, leading toward a so-called Reinforcement Learning-based Aggregate Computing. Learning, thus, does not replace the AC methodology for defining the programs but it is best understood as a technique that supports and improves the AC algorithm design process.

3.3 Building Blocks Refinement

A major advantage of AC as a programming model is its compositionality: complex collective behaviours (e.g., the maintenance of a multi-hop connectivity channel) can be expressed through a combination of building blocks capturing simpler collective behaviours (e.g., gradients). Since building blocks are fundamental bricks of behaviour that often recur in programs, their bad and good qualities (cf., convergence speed, stability, etc.) tend to amplify and affect behaviours that depend on them. Therefore, research tends to investigate refined variants of building blocks that provide the same functionality but are more effective or efficient under specific assumptions or contexts (e.g., high mobility, situations where stability is more important than precision, etc.) [3, 4]. With a library of algorithms, the designer can choose the best combination of building blocks that are well-fitted for a given environment, and even substitute a building block with a variant implementation without substantially affecting the application logic. In general, a building block can be seen as a black box (function) that takes a set of input fields (e.g. metric, perception fields, constant fields, etc.) and yields an output field. To increase its flexibility, such a function could leverage a refinement policy able to affect the behaviour of the building block over time or in a certain situation. This policy could be a feedback loop, hysteresis, or custom logic to solve a specific problem. We aim at structuring the learning of refinement policies through RL [1]. Our idea is that it should not be the designer who codes a particular block to be used, but that it is the learning algorithm that understands, given a global policy to be optimised following a utility function, what actions need to be activated.

Fig. 2.
figure 2

Reinforcement learning schema used in our simulations. The learning algorithm is applied at simulation time (for T episodes) improving a shared Q table. At the deployment time then, the agents exploit a local copy of the optimal \(Q^*\) table found by learning.

However, using Machine Learning to improve AC programs exposes several non-trivial challenges:

  • scale-free behaviours: the learned policy should work in small networks as well as in large networks as fostered by the AC abstractions;

  • the state typically depends on continuous value as computational fields are often associated with continuous data (e.g. temperature or distance fields);

  • multi-agent credit assignment problem: it is not easy to adequately reward and credit local actions for their contribution to the eventual convergence to the “target” field denoting the desired, emergent, collective result.

The AC system model is based on cooperative and homogeneous behaviour. Indeed, when a node participates in an aggregate system, it has to execute an aggregate program shared within the whole system, which yields different outcomes according to the contexts on which it gets evaluated. This leads to handling homogeneous MARL, hence our goal will be to find one policy for the entire ensemble, partially solving the scale-free behaviour problem, since the learned policy will not depend on the system’s size.

The continuous and variable state problems are typically tackled using Deep Learning to learn the right state representation for a given problem [28]. In this case, instead, we used a handcraft feature engineering process since we are more interested in devising a general design process.

Concerning the multi-agent credit assignment problem, we decided to use offline learning in a typical CTDE setting (Fig. 2). This way, we assess the influence of an individual in comparison to the entire system, which cannot be done in online learning due to the difficulty of achieving global snapshots of the system in a reasonable time.

Learning Schema. The learning algorithm is seen as a state (\(s_t\)) evolution function in which the nodes try to apply a correction factor () following a policy (\(\pi ^Q_{target}\) or \(\pi ^Q_{behavioral}\)) refined by learning. The state is built from a neighbourhood field of the building block generic output (\(o_t\)) passed as input. Listing 1.1 exemplifies the general program structure used to combine RL with AC for improving building blocks. The branching operator () on condition makes it possible to use the CTDE schema since when the is there is no need for a central entity (). The Q table is gathered using , a ScaFi operator used to query sensors and collect data from them. At simulation time, Q is a shared object, but at runtime each agent owns a local table.

figure t

Finally, the produced \(o_{t}\) is returned to the caller.

In this case, we encoded the learning process with AC itself. Though we could have extracted the learning process from the AC program, we took this decision because: (i) it allows us to extend learning by exploiting neighbourhood Q table fields – so we can think of doing non-homogeneous learning in different areas of the system; (ii) the scheme for taking state and choosing actions is the same as the one we would need for learning, so the only difference is in the branch; and (iii) it can simply be extended to online learning.

3.4 Reinforcement Learning-based Gradient Block

The gradient block (cf. Sect. 2.2) could be generalised as follows:

figure u

where is a function that determines how to evolve the gradient based on the field of current gradient values and current (estimation of distances to neighbours). In this article, we consider as a hole that a RL algorithm will fill through raw experience with actions aiming at incrementally constructing a gradient field, hopefully mitigating the rising-value issue (cf. Sect. 2.2). To frame our learning problem, we adopt the above-described general schema (Fig. 2). The state and action functions are inspired by the CRF algorithm. The function must hold enough information to know when agents should speed up the local rising of values. In this case, we encode the state as the difference between the output perceived by a node and the minimum and maximum gradient received by its neighbours: \(s_t = (|min_t - o_t|, |max_t - o_t|)\). These data have to be discretised; otherwise, the search space would be too big and the solution could suffer from overfitting. The discretisation is ruled by two variables, maxBound and buckets. The former constrains the output to be between \( - radius * maxBound\) and \( radius * maxBound \) (where radius is the maximum communication range of the nodes). The values outside that range will be considered as the same state. The buckets variable rules the division count of the given range. Finally, we stack two time steps in order to encode history inside the agent state: \(h_t = [(s_{t - 1}, s_t)]\). \(h_t\) is used as the state function for our RL algorithm. Hence, the cardinality of the state space of \(|s_t| * |s_t| = buckets^4\). The action space is divided into two action types: ConsiderNeighborhood is the action that will produce the classic gradient evaluation, while Ignore(velocity) ignores the neighbour data and increases the gradient at a given velocity. So, the update function is defined as:

figure aa
Table 1. Summary of the simulation variables. A simulation is identified by a quadruple (ijkl) of indexes for each variable.

Finally, the reward function is described as follows:

figure ab

where returns the field of node identifiers. The idea is to push the nodes to produce an output that is very close to the ideal, correct gradient value as provided by an oracle (). When this is the case, we provide a value equals to 0; instead, when the value is different from the expected one, we provide a small negative reward, \(-1\), in order to push the nodes to quickly seek the situation where the actual and ideal value match.

4 Evaluation

To evaluate our approach, we run a set of simulated experiments and verify that an aggregate system can successfully learn an improved way to compute a gradient field (cf. the gradient block described in Sect. 2.2). To this purpose, we adopt Alchemist [35], a flexible simulator for pervasive and networked systems that comes with a support for aggregate systems programmed with ScaFi [14]. The source code, data, and instructions for running the experiments have been made fully available at a public repositoryFootnote 1, to promote reproducibility of results.

4.1 Simulation Setup

The simulated system consists of N devices deployed in a grid. We use two kinds of grid-like environments. They both have the same width (200 m) and distance between nodes (5 m), but they differ in the row count. In one case, only one row exists (so the nodes are placed in a line). In the other case, there are five rows. The total number (N) of agents is then defined as \( 200/5 * rows \). So in the first case, we have a total of 40 agents, in the second case we have 200 agents. Each node asynchronously fires the round evaluation at 1 Hz. The leftmost and the rightmost agents are marked as source nodes. Each simulated episode lasts 85 s (T). For simulating the slow rising problem, we drop the left source at 35 s (\(C_s\)), and so the left part of the system starts to rise until eventually stabilising to the correct computational field. An entire simulation lasts \(N_E = 1200\) episodes, in which in the first \(N_L = 1000\) the system uses RL to improve a shared Q table, and then in the last \(N_T = 200\), the system deploys the Q table found in each agent. In these last runs, the agents act following the greedy policy.

As learning algorithm, we used Hysteretic Q-Learning (cf. Sect. 2.3). As behavioural policy, we use an \(\epsilon \)-greedy with an exponential decay to balance the exploration-exploitation. We make this choice because without using the decay the policy found tends to be not optimised (i.e. the exploration is preferred w.r.t exploitation). At each episode i, the \(\epsilon \) value is updated as \(\epsilon _i = \epsilon _0 \cdot e^{{i} / \theta }\).

Several variables are used, summarised in Table 1, so we perform a grid search to find the optimal combination. To evaluate a particular configuration, we verified the total error performed in the last \(N_T\) episodes. This is calculated from the error performed by each node i at each time step t:

$$\begin{aligned} error_i^{t} = |gradient_i^{t} - simulated_i^t| \end{aligned}$$
(8)

Then for each time step t, we evaluate the average system error as:

$$\begin{aligned} error_{t} = \frac{1}{n}\sum _{i = 0}^N error_i^t \end{aligned}$$
(9)

And finally, the error of each episode is evaluated as:

$$\begin{aligned} error_{episode} = \sum _{t = 0}^T error_{t} \end{aligned}$$
(10)

We choose the configuration by observing the box plots (Fig. 3a) and taking the lowest average error in the last \(N_T\) episodes.

Fig. 3.
figure 3

Performance of our RL-based gradient algorithm with = 20.

4.2 Results and Discussion

Figure 3 shows the performance of our Reinforcement Learning-based gradient algorithm. Figure 3a was used to choose which configuration was the best. Figuer 3b shows the error trend as the episodes change. The second row shows the trend of the mean error over the last \(N_T\) episodes. The coloured area under the curve defines the standard deviation. The dashed vertical line is the time at which the source change occurs. Finally, the last row shows the average output of the various algorithms. In the following we discuss the result reached with the best configuration, that has \(\gamma = 0.9\), \(\epsilon _0 = 0.5\), \(\theta = 200\), \(\alpha =0.3\) and \(\beta =0.05\).

Our goal was to create a solution that outperforms the classic gradient against the rising-value problem. In fact, the system eventually learns how to speed up the gradient rising: observing the Figure 3b the \(error_{episode}\) of the new algorithm is lower than the error produced by the standard solution. In particular, this means that the agents learn the moment when they should ignore their neighbourhood and increase the output with a certain velocity (i.e. using the action).

This intuition is enforced by Fig. 3c to 3f. In particular, in Fig. 3e and 3e the behaviour is more evident due to the reduced number of agents: when the error is maximised (due to the source that disappears), the error decreased faster than the naive gradient (and, consequently, the output is faster growing). Furthermore, we can also observe that the overall algorithm behaviour is comparable with the CRF handcrafted solution for the rising problem. Indeed, both have a phase of speed up followed by another phase of overestimation (i.e. the system-wide output is greater than the true gradient field output) that eventually brings the system to slowly reaches the correct value. Moreover, not only the RL-based solution has similar behaviour to CRF, but also it has comparable performance – it means that the aggregate reaches a near-optimal policy with these state-action-reward settings.

We want also to underline that the policy learned is one and shared with the whole system. Thereby, the policy can be easily scaled in deployments with different node counts. In this case, the same function outperforms our baseline in a system with different nodes and deployment configurations.

Finally, we want to stress that the nodes do not fire synchronously, and thus there is not any kind of global and shared clock: any node round evaluation order reaches the same behaviour in our test deployments. This, again, makes it possible to use the same policy in different scenarios due to the unknown local aggregate programs evaluation order.

5 Conclusion

This paper discusses the integration of Aggregate Computing – a programming approach for Collective Adaptive Systems – and Reinforcement Learning, with the goal of fostering the design of collective adaptive behaviour. In particular, we propose to use RL as a means to improve building block AC algorithms. Our approach is applied to improving the gradient algorithm, one of the key AC algorithms, where learning is performed through Hysteretic Q-Learning. We evaluate the approach through synthetic experiments comparing the reactivity of different gradient algorithms in dealing with the rising value problem.

This work is the first effort towards Reinforcement Learning-based Aggregate Computing. In fact, there are still many aspects that need to be analysed in detail both at the conceptual and practical levels. First of all, the approach could be tuned to learn gradient strategies for smoothness or maximal reactivity in highly variable scenarios, and compared with state-of-the-art algorithms like BIS and ULT [4]. Secondly, the approach could be systematically applied to other building blocks as well [41]. Very interesting would also be the application of Machine Learning at the aggregate execution platform level, e.g. to improve the round frequency to reduce power consumption, reduce the amount of data exchanged between neighbours, or support opportunistic re-configuration of aggregate system deployments.