1 Introduction

The next generation of advanced infrastructures will be characterized by the presence of complex pervasive systems, composed of heterogeneous devices, sensors and actuators consuming and producing high volumes of interdependent data. Sensors, drones, autonomous cars, and robots are becoming smarter, cheaper, smaller and available in everyday life. They are equipped with increased memory and processing capabilities. In this context, services span wide pervasive systems, involving a very large number of multiple devices. Fog and edge-computing solutions [31] are already challenging centralized solutions by pushing some of the computation away from central servers and closer to the devices themselves. They provide a lower level of latency and better user data privacy and security. There is still a need to accommodate dynamicity in large-scale scenarios, to adapt the arriving or departing devices, and to ensure reliability and expected quality of services.

Coordination models [42] provide a natural solution for scaling up such scenarios. They are appealing for developing collective adaptive systems working in a decentralized manner, interacting with their local environment, since the shared tuple space on which they are based is a powerful paradigm to implement bio-inspired mechanisms (e.g., stigmergy) for collective interactions. Coordination infrastructures provide the basic mechanisms to implement and deploy collective adaptive systems. Therefore, our proposal extends a bio-inspired coordination model that ensures communication and tasks coordination among heterogeneous and continuously changing devices. The model provides coordination rules that autonomous entities (devices) employ to coordinate their behavior and update information related to themselves or their local environment.

Our previous work on self-composition of services, also based on a bio-inspired coordination model, exploits syntactic means only (i.e., shared keywords for input, output types) as a basis for building on-the-fly chains of services [9,10,11]. We enhanced this work with a learning-based coordination model and discussed reliability of service composition, in a single node, in terms of results and convergence [20]. In this paper, we extend our learning-based coordination model to accommodate spontaneous self-composition of services in fully distributed scenarios using multiple nodes and present the use of QoS to ensure reliability. Our vision to meet these requirements consists in moving to a fully decentralized system, working as a collective adaptive system, with the three following characteristics: (1) Dynamic services are composed and provided on-demand; (2) such services result from the multiple interactions of the devices involved in the production of the services and working as a decentralized collective adaptive system; (3) use of reinforcement learning for ensuring optimization and reliability.

Section 2 discusses related works. Section 3 presents background information on the bio-inspired coordination model from which our work derives from. Section 4 describes information handled by agents within the coordination tuple space, and how agents form requests for services and answers to such queries. Section 5 discusses self-composition of services. Section 6 presents our coordination model and its extension with reinforcement learning (RL). Section 7 discusses how to achieve optimization and reliable self-composition of services on-demand. Section 8 presents a humanitarian scenario with a practical use case. Section 9 summarizes our results. Finally, we come to a conclusion and future work in Sect. 10.

2 Related works

Orchestration [27] and Choreography [28] are an automated arrangement, coordination, and management of services. They depend on pre-defined composition schemas leading to lower levels of robustness, adaptation or fault-tolerance, particularly in very dynamic conditions with services arriving or departing. Popular methods, like BPEL, WS-Choreography and XPDL, are based on pre-defined workflow. The static character of these traditional composition approaches has been recently challenged by the so-called dynamic service composition approaches involving semantic relations [33], and/or artificial intelligence planning techniques [41]. Early works on dynamic building or composing services at run-time include spontaneous self-composition of services [24]. One of the main challenges of these approaches is their limited scalability and the strong requirements that they pose on the details of the service description. Other approaches propose to enhance composition by regrouping equivalent or semantically relevant services [17]. Early works, dealing with context-aware applications and service composition, rely only on connecting services’ inputs and outputs to provide a spontaneously built chain of services [16]. Our work targets highly dynamic services composition without any pre-designed composition, providing compositions possibly more complex than chains of services. At the moment, our proposal relies on syntactic means only for composition.

Coordination models have proven useful for designing and implementing distributed systems [8]. They are particularly appealing for developing self-organizing systems since the shared tuple space on which they are based is a powerful paradigm to implement self-organizing mechanisms, particularly those requiring indirect communication (e.g., stigmergy). Early coordination models are designed to be deployed on one node (device), such as Linda [15]. Recent coordination models, some of them inspired by nature, can be distributed across several nodes such as TuCSoN [26] based on Linda, TOTA [21] and Proto [3]. We extended the SAPERE [42] coordination model with self-composing [9, 10] and reinforcement learning features [20].

ASCENS [4, 39, 40] is an autonomous, self-aware approach with parallel and distributed components based on a formal language for modeling and programming systems called SCEL [12]. It aims to handle communication between millions of nodes with complex interactions and behaviors ensuring functional and non-functional properties. Our proposal does not involve formal aspects, but is ready to be deployed in IoT scenarios.

Reinforcement learning solutions are suitable for problems or search space modeled into a Markov decision process (MDP) [30]. A multi-agent reinforcement learning for adaptive service composition [35] is used to achieve better performance and to select the most reliable service among similar ones. However, it relies on pre-designed workflow. A survey of automated web service composition methods describes semi-dynamic solutions using AI planning methods [29]. However, these methods do not offer a self-composition of services at run-time. Recent work on learning approaches, inspired by cognitive sciences, attempts at removing pre-defined goals and avoids objective functions [22]. Other approaches, such as the Self-Adaptive Context-Learning (SACL) Pattern [5], involve a set of dedicated agents that collaborate to learn and map the current state of agents perceptions to actions and effects. Wang et al. [37] provide a similar solution to our proposal. This work uses a multi-agent reinforcement learning approach and QoS to dynamically compose services. It uses a dynamic service composition called WSC-MDP to generate a transition graph. However, once the graph is created, no dynamic service composition is allowed at run-time. Focus is on selecting the best agent providing the highest QoS in each state. Thus, the solution exploits the workflow that provides the best cumulative reward [37]. Wang et al. [37] show limitations of the approach in large-scale scenarios, and suggest using deep Qlearning to improve agent action prediction and efficiency when the state-action pairs number becomes too high [25, 36]. Our solution aims to offer a dynamic self-composition at run-time involving multi-agent learning to discriminate among pertinent solutions the reliable ones.

In relation to non-functional properties, evolutionary approaches such as those based on genetic algorithms (GA) have also been proposed for service composition [7], motivated by the need of determining the services participating in a composition that satisfies certain Quality of Service (QoS) constraints [2]. In this paper, we use QoS to discriminate among pertinent self-compositions and to return reliable services.

To the best of our knowledge, no approach currently combines learning, coordination model, and spontaneous self-composing (built on-demand) services.

3 Coordination model

The concept of a coordination model [8] depicts the way a set of entities and nodes interact by means of a set of rules. A coordination model consists of: the entities being coordinated, the coordination rules, to which entities are subjected during communication processes and the coordination media that identifies conveyed data and its treatment (Fig. 1).

Fig. 1
figure 1

SAPERE coordination model: multi-agent system using eco-laws and operations that provide interaction among LSAs

Our work derives from SAPERE model [42], a coordination model for multi-agent pervasive systems inspired by chemical reactions [14]. It is based on the following concepts:

  1. 1.

    Software Agents: active software entities representing the interface between the tuple space and the external world including any sort of device (e.g., sensors, actuators), service and application.

  2. 2.

    Live Semantic Annotations (LSA): Tuples of data and properties whose value can change with time (e.g., temperature value injected by a sensor is updated regularly). Section 4 discusses additional LSAs used for self-composition.

  3. 3.

    Tuple space: shared space (i.e., coordination media) containing all the tuples in a node. There is one shared space for each node (node could be a raspberry pi, smartphone, etc).

  4. 4.

    Eco-laws: chemical-based coordination rules, namely: Bonding (for linking an agent with a data that it referred to, was waiting for, concerns it, etc. and provided by another agent); Decay or Evaporation (regularly decreasing the pertinence of data and ultimately removing outdated data); Spreading (for propagating LSAs to neighboring nodes).

  5. 5.

    Operations: list of operations that are executed by the system like Inject (for inserting a new LSA), Update (for updating an LSA’s fields) or Remove (for removing an LSA).

4 Live semantic annotation

In this paper, we extend the notion of LSA as defined in the original SAPERE model [42], in order to accommodate the notion of services and their self-composition. Each agent, acting as a wrapper for a service, is represented by one LSA.

An LSA is defined by two lists: (1) a set of Service properties or property names to which the agent wants to be alerted to (through bonding) when they become available, which we note S. They usually refer to services or outputs provided by other agents; (2) a set of Properties that an agent provides (i.e., they correspond to the service provided by this agent, as an output; or to information about itself or the environment), which we note P.

$$\begin{aligned} \begin{aligned} LSA {:}{:}{=}{=} \{&S = [ svc_1, \ldots , svc_m ],&P = [ P_1, \ldots , P_n ] \} . \end{aligned} \end{aligned}$$

Let us note \({{\mathcal {P}}}\) a set of property names. Service properties \(svc_j \in {{\mathcal {P}}}\) are property names to which the agent wants to bond to. The agent wants to be alerted as soon as values corresponding to service \(svc_j\) are injected into the LSA space.

Fig. 2
figure 2

Service production and consumption: Agent_1 provides temperature property, while Agent_2 consumes temperature property

\(P_i\) are properties that an agent provides to the system. Properties in P are of the form \(P_i = <key_i, value_i>\), where \(key_i \in {{\mathcal {P}}}\) are property names, and property values \(value_i\) can take the following forms:

  • \(\emptyset \): a value can be temporarily empty, due, for instance, to the Evaporation eco-law that removes the value. This can be the case for a temperature sensor whose value is no longer valid after a certain time.

  • \(\{v\}\): a single value presenting the value that the agent inserts in the coordination space as the service it provides. For instance, an agent working on behalf of a temperature sensor provides the current value of the temperature;

  • \(\{v_{i1}, \ldots , v_{in}\}\): a vector that contains a list of values such as GPS coordinates.

  • a matrix that contains many lists of values such as multi-dimensional coordinates.

Depending on the properties that compose the LSA and when the agent injects or updates the LSA, we distinguish the following cases:

  • Service production only:

    LSA = \(\{ P = [ P_1 \ldots , P_n ] \}\). The LSA contains Properties only. In this case, the agent provides a service, which consists in regularly updating the values corresponding to the set of properties. The agent does not require further service, interaction or information with/from other agents. This is, for instance, the case of Agent_1 (Fig. 2), working on behalf of a temperature sensor that regularly updates the current temperature value, and does not require any additional information from other agents.

    Example : LSA = \(\{ P = [ <Temperature: 20> ] \}\).

  • Service consumption only:

    LSA = \(\{ S = [ svc_1, \ldots , svc_m ] \}\). The LSA contains Service properties only. In this case, the agent wishes to be alerted as soon as one or more values corresponding to the specified service properties are injected in the tuple space. The agent may or not provide in return a service of its own. It simply consumes the values provided by other agents. It retrieves these values when bonding with the specified service properties. This is, for instance, the case of Agent_2 (Fig. 2) wishing to be informed as soon as a new Temperature value is injected in the system. In this example, Agent_2 waits for a temperature to be provided as a property in the tuple space. Agent_2 LSA bonds with Agent_1 LSA as soon as the latter updates the temperature value. Agent_2 then consumes the temperature value. This usually prompts an internal processing by Agent_2 that may result in various actions, such as controlling an actuator.

    Example : LSA = \(\{ S = [ Temperature ] \}\).

  • Service announcement with input required:

    LSA = \( \{ S = [ svc_1, \ldots , svc_m ], P = [ ] \}\). A service announcement consists in injecting an LSA with a set of requested Service property names \(svc_1, \ldots , svc_m\), and an empty set of provided Properties. The Service property names \(svc_1, \ldots , svc_m\) indicate the inputs expected by the agent. This is the case of Agent_1 (Fig. 3 (1)) wishing to be informed as soon as a CityName value is injected in the system. Agent_1 is a service that provides an output based on a city name, as, for instance, a weather service.

    Example : LSA = \(\{ S = [ CityName ], P=[] \}\).

  • Service request with input provided:

    LSA = \( \{ S = [ svc_1, \ldots , svc_m ], P = [ P_1, \ldots , P_n ] \}\). A service request consists in injecting an LSA with a set of requested Service property names \(svc_1, \ldots , svc_m\), and a set of provided Properties \(P_1, \ldots , P_n \). The P list provides the names and values of inputs to the requested service, while the S list mentions the Service properties names expected as output. For instance, Agent_2 (Fig. 3 (1)) requests the weather forecast for a given city. It mentions in the S list the Service property name Weather, and in the P list the city name under the form \(<CityName: Geneva>\). The latter serves as input for other agents, like Agent_1, waiting for a city name value. Agent_1 LSA bonds with Agent_2 LSA as soon as it becomes available in the tuple space (Fig. 3 (1)).

    Example :

    LSA=\(\{ S = [ Weather ], P=[<CityName: Geneva>] \}\).

  • Service answer with input and output:

    LSA = \( \{ S = [ svc_1, \ldots , svc_m ], P = [ P_1, \ldots , P_n ] \}\). A service answer consists in updating the P part of an LSA. The P list provides the names and values of output to the requested service, while the S list mentions the Service properties names provided as input. The agent consumes, as input, values corresponding to service properties list: \(svc_1, \ldots , svc_m\), and provides as the result of an internal processing, as output, the values corresponding to the provided property names: \(<key_1: v_1>, \ldots , <key_n: v_n>\). After bonding with the service request, Agent_1 (Fig. 3 (2)) provides through an internal computation the weather forecast for the city name. It actually updates its previous LSA with the Weather value for the provided CityName. Agent_2 LSA, requesting a Weather property, can then bond with the LSA just updated by Agent_1. Agent_2 then retrieves the value, Sunny, for the Weather property.

    Example :

    LSA=\(\{ S = [ CityName ], P=[<Weather: Sunny>] \}\).

Fig. 3
figure 3

Service announcement, request and answer: Agent_1 consumes CityName property to provide weather forecast, whereas Agent_2 requests Weather property by providing a CityName property

When an agent provides a service without waiting for specific input, the above Service production only and Service consumption only applies. It is important to note also that any agent waiting for a property called CityName will be sensitive to the LSA injected by Agent_2 (Fig. 3) above, and possibly provide its own output, not necessarily in line with what is expected by the query agent. It is also important to highlight that the set of properties P serves either to inject input values, prompting agents to act and providing a service, or to inject output results as the result of the agent calculation.

5 Spontaneous service self-composition

A spontaneous service built on-demand and provided as the self-composition of various other services results from the collective interactions of a series of agents, each providing a portion of the final requested service. It arises from the self-composition of the diverse services provided by the agents at run-time. A Service request with input (see Sect. 4) is first injected in the tuple space. The query is then analyzed by agents which are sensitive to the input properties. Coordination of the different agents, as well as the production of service composition on-the-fly, occurs through this indirect retrieval and injection of property in the shared tuple space (some agents waiting for some properties provided by other agents to start, continue or finish their work). Such kinds of models are efficient in a dynamic open system (such as pervasive scenario), where agents can communicate asynchronously without having global knowledge about the system and its participants. Agents can join or leave the system at any moment. They are not known in advance, and so are their provided services. This loose coupling among the agents is key for the on-demand service composition.

In this section, we are concerned by the on-demand service composition: (1) how the diverse agents collectively interact together and provide a part of the requested service; (2) how the whole service arises from the self-composition, and (3) how the whole system produces results. We also discuss how the whole service unfolds first on one node only, second on several distributed nodes.

5.1 Property

In Sect. 4, we defined a Property \(P_i = <key_i, v_i>\) as a pair of (key, value). Examples of the previous section are concerned with bonding with services inputs and outputs only. If we consider the example above of the city weather, we observe that the weather forecast agent should accommodate simultaneously different requests for weather and for different cities. We must then provide a way to furnish individual answers to each requesting agents. Additionally, due to the loose coupling discussed above, several compositions of services may arise from a single query. We need also to discriminate the various composition schema, in order to later learn the best composition. For this reason, in this section, we extend the definition of Property \(P_i \in P\), in order to allow simultaneous queries and identify the requesting agent at the source of the query. We thus extend Property \(P_i\) as follows:

$$\begin{aligned} \begin{aligned} P_i {:}{:}{=}{=}&\{<key_i: v_i>,<\#B_i:BondedAgent>, \\&<\#Q_i:QueryAgent>, <\#C_i:Schema>, \\&\#True/False \} \end{aligned} \end{aligned}$$

where:

  • \(key_i\) : the property name

  • \(v_i\) : the value of property \(key_i\)

  • \(\#B_i\): the id of the agent that provided the LSA to which property \(key_i\) bonded

  • \(\#Q_i\): the id of the agent that is at the origin of the request

  • \(\#C_i\): a sequence of property names representing the composition schema. The requested property names are separated by a vertical line “|” from the provided properties during composition.

  • \(\#True/False\) : a flag that indicates whether a property \(key_i\) has been consumed (#True) or not (#False) by other agents.

A composition schema is a concatenation of properties name, corresponding to the unfolding of the services composition. We say that a composition schema is partial when the input property is present, but the output property is not yet reached. We say that a composition schema is final when it starts with the input property name and ends with the output property name.

Fig. 4
figure 4

Self-composition within one node: Agent_1 requests C property and provides A property, Agent_2 consumes A property and provides a B property and Agent_3 consumes a B property and return a C property

5.2 Self-composition within one node

Figure 4 shows the case of a sequence of service compositions (chain of services) taking place inside the same computational node.

Step 1: Agent_1 injects the original service request: It provides an input value “a” for a Property name “A,” and waits for an output value for a Service property name “C.” This LSA has the form of a Service request with input provided (see Sect. 4). At this point, #B is empty as the LSA did not bond with any other LSA yet; #Q:Agent1 indicates the id of Agent_1 as being at the origin of the request; and the schema composition is of the form #C:C|A.

Step 2: Agent_2 bonds with the Agent_1’s LSA as it is sensitive to the “A” property name. It updates its LSA by adding a new value “b” for Property “B”; it indicates the bond with Agent_1 LSA by updating #B:Agent1; it indicates the id of the agent at the origin of the request #Q:Agent1, and provides a schema #C:C|A,B, since Agent_2 adds its own contribution to the composition by providing an intermediary value for Property “B.”

Step 3: At this point, Agent_3 bonds with Agent_2 LSA and provides a new value “c” for Property “C”; indicates the bond with Agent_2 by updating #B:Agent2; indicates the id of the agent at the origin of the request #Q:Agent1, and completes the schema #C:C|A,B,C, since Agent_3 adds its own contribution to the composition by providing a value for Property “C” from the intermediary value provided for “B,” thus completing the schema.

Step 4: Agent_1 then bonds with Agent_3’s LSA and retrieves the expected value “c” for Service property name “C” that it was waiting for. Not represented in the figure, Agent_1 then updates #B:Agent3, indicating the bond with Agent_3. It also retrieves from Agent_3 the complete schema.

In this example, we observe a chain of service composition providing the value expected by Agent_1. Neither Agent_2 or Agent_3 can provide the service on their own. Combined in a chain, first Agent_2 provides an intermediary value “b” and then Agent_3 provides the expected value “c” from the intermediary one.

Regarding the chosen flag, it is initially set by default to #False in each LSA, and progressively updated to #True once the corresponding property is consumed by another agent.

5.3 Multiple queries

Figure 5 shows the case of multiple service requests handled by the same agent.

Step 1: Agent_1 and Agent_4 both inject a similar LSA for a Service request with input provided (see Sect. 4). Agent_1 wishes to retrieve a value for Service property name “C,” providing in input value “a” for Property “A.” Agent_4 also, but it provides in input a different value “a’.” In both cases, schemas are of the form: #C:C|A.

Step 2: Agent_2 is sensitive to both queries and updates its LSA with corresponding values “b” for the query of Agent_1 and “b’ ” for the query of Agent_4. The rest of the property is updated accordingly in order to keep track of the bonding (Agent_1 and Agent_4, respectively), the original agent that made the query (again here Agent_1 and Agent_4, respectively), and the corresponding schemas #C:C|A,B.

Step 3: Agent_3 is sensitive to Agent_2 LSA, and updates its own LSA with value property “c” corresponding to input “b” and with value property “c’ ” for input “b’.” The rest of the property is updated accordingly in order to keep track of the bonding agent (in both cases Agent_2), the original agent that made the query (Agent_1 and Agent_4, respectively), and the corresponding schemas #C:C|A,B,C.

Step 4: Agent_1 and Agent_4 bond with Agent_3 LSA, each retrieving its own result, i.e., the one tagged by the query field #Q indicating itself as the original query agent. The composition for Agent_1 is depicted with solid arrows, the one for Agent_4 with dashed arrows.

In this example, we see that each property of the property list corresponds to a query. Thus, each query agent (Agent_1 and Agent_4) bonds only with the properties related to its own query.

Fig. 5
figure 5

Multiple queries at run-time: Agent_1 and Agent_4 provide a A property and request a C property. Agent_2 consumes an A property and provides a B property, whereas Agent_3 consumes a B property and provides a C property

5.4 Self-composition across multiple nodes

Figure 6 shows the case of a service composition arising across several nodes. Indeed, agents can sit at different nodes of a distributed network and nodes can be connected in various manner, e.g., as a peer-to-peer network.

Fig. 6
figure 6

Self-composition across multiple nodes at run-time: Node_1 hosts Agent_1 (as a query agent) and Agent_2 that provides a B property and consumes an A property. Node_2 hosts Agent_3 that provides a B property and consumes an A property. Node_3 hosts Agent_4 that provides a C property and consumes a B property

Step 1: Agent_1 injects a service request LSA. Agent_1 wishes to retrieve a value for Service property name “C,” providing in input value “a” for Property “A.”

Step 2: A copy of this LSA automatically spreads to all nodes in the network (not shown in Fig. 6).

Step 3: Agent_2, located in the same node as Agent_1, bonds with Agent_1 LSA present in Node_1. Agent_3, in Node_2, bonds with the copy of Agent_1 LSA that reached Node_2. In both cases, Agent_2 and Agent_3 update their respective LSAs as discussed in the previous examples. Agent_2 and Agent_3 both provide the value “b” of type “B.” They update the bonding and original query field, in both cases with Agent_1.

Step 4: A copy of the updated LSAs of Agent_2 and Agent_3 spreads to all nodes.

Step 5: Agent_4 located at Node_3 bonds with both of them. It updates its own LSA, adding two new properties “C” as it bonds with both Agent_2 and Agent_3.

Step 6: As before, a copy of Agent_4 LSA spreads to all nodes.

Step 7: Finally, Agent_1 LSA, still waiting at Node_1, bonds with Agent_4 LSA. The result of the composition is provided to the query agent that originally injected the query. In this case, Agent_1 retrieves the value “c” resulting from the composition involving Agent_2, since the tag #True corresponds to the composition provided through Agent_2. The result of the composition involving Agent_3 is not consumed. Even though the value is the same, the quality of service may vary, and thus one of the compositions may be favored against the other.

Once an agent starts bonding, its LSA is updated (e.g., with a new property), and a copy automatically spreads along the network. Other LSAs present in the different tuple spaces will react to it. If they are sensitive (i.e., to bond) to the provided properties, they will update their respective LSAs, which in turn automatically spread along the network. All copies are dynamically removed after a short time using the Decay eco-law.

6 Learning-based coordination model

Our coordination model derives from the SAPERE coordination model [42]. We enhanced it by regulating the Bonding eco-law thanks to a multi-agent reinforcement learning module. Reinforcement learning (RL) algorithms are machine learning algorithms for optimization and decision making under uncertainty in sequential decision problems. The problems solved by RL are modeled among others through a Markov decision process (MDP) [32].

6.1 Multi-agent reinforcement learning

Multi-Agent Reinforcement Learning (MARL) [34] is an extension of the RL framework where multiple (in contrast to the standard RL framework) agents work in fully cooperative, fully competitive, or mixed manner [6]. Multi-agent learning solutions are appealing since they help adapting to complex and dynamically changing environments. This is particularly true for concurrent multi-agent learning where a given problem or search space is subdivided into smaller problems and affected different agents [13]. An important aspect of the RL learning process is called exploration versus exploitation. “Exploiting” (e.g., operating with the current best choice) allows capitalizing on the current optimal action, while “exploring” can discover new actions that can outperform the best choice so far [32]. The \(\epsilon \)-greedy and the Boltzmann exploration are popular exploration algorithms that consider those two aspects. For a more detailed survey of RL techniques and exploration algorithms, the reader can refer to [18].

In our self-composition of services problem, each agent has to maximize its own gain to yield the most suitable results. Therefore, agents are neither cooperative nor competitive. They work in a reactive fashion to any input provided in the tuple space. Indeed, in a non-stationary problem, such as the one tackled herein, convergence is not guaranteed as an agent’s reward depends also on the action of other agents.

As we have seen in Sect. 5, coordination of the different SAPERE agents occurs through an indirect retrieval and injection of LSAs updating property values in the shared tuple space. Each node provides few simple services and data regarding its environment. Many different results might occur through self-composition. Therefore, we enriched the original coordination platform with a reinforcement learning (RL) module for each agent, and a Reward operation within the coordination platform itself, as shown in Fig. 7. The RL module allows agents participating in the compositions to progressively learn to actually participate or refrain from participating in a composition. Agents working on behalf of end users provide a corresponding positive or negative feedback through the Reward operation. Rewards are dynamically propagated to all agents participating in the composition. The rewards are provided to the agents in an event-driven manner. A positive feedback rewards a pertinent composition with acceptable QoS (reliable), while a negative feedback rewards non-pertinent compositions or those with an unacceptable QoS.

Fig. 7
figure 7

Learning-based coordination model: extension of SAPERE coordination model with a reinforcement learning (RL) module in each agent and a reward operation to forward users’ feedback

6.2 QLearning

QLearning [38] is a reinforcement learning algorithm that allows agents to learn a policy that maximizes rewards by taking action when being in a given state. Our problem can be modeled by a Markov decision process (mdp) as we have a set of finite actions and states. Each agent has to choose from a set of actions depending on its current state when a bonding is triggered. Our model is formed by:

  • StatesS: a set of all possible composition schemas. A state is updated in the #C composition property attribute.

  • Actions : \(A=\{Ignore, React\}\);

    • Ignore the bonded LSA : This action avoids further useless bonding.

    • React to the bonded LSA and update its own LSA properties by adding properties and information to the composition schema.

  • Reward : All agents that participated in a successful composition (with final composition schema) are rewarded positively or negatively by the user depending on the actual relevance of the result for that user.

  • Exploration algorithm : \(\epsilon \)-greedy [18]. This algorithm has a probability \(\epsilon \) to select a random action and a probability \(1-\epsilon \) to select the action that maximizes the value of the approximation of Q(sa) (see below).

  • Q function : \(Q:S\times A \rightarrow {\mathbb {R}}\), where:

    \(Q^i_{t+1}(s_{t},a_{t})=Q^i_t(s_{t},a_{t})+\alpha \times ( R^i_t + \gamma \times max_a(Q^i_t(s_{t+1},a)) - Q_t^i(s_{t},a_{t})))\),

    \(\forall i \in \{1,..,n\}\), where n is the number of agents that participated in the service self-composition, t is the current time, \(s_t\) is the state at time t in which the agent took action \(a_t\), \(s_{t+1}\) is the next state reached by the agent after taking action \(a_t\) and \(\gamma \) is the discount factor that determines the cumulative discounted future reward. For the rest of this paper, except where specified: we fix \(\gamma \) to 0.9 in order to be on line with [18]; \(\epsilon = 0.2\) allowing for a 20% exploration from the agents; and \(\alpha = 0.3\) to allow the agents to be moderately sensitive to feedback. See Sect. 7.3.1 for more details on \(\alpha \).

The collective interactions among the agents produce all possible compositions. The goal of an agent is to maximize the reward it gets from the environment by learning which action leads to the optimal reward. Afterward, the agents, equipped with the RL module, will use their learned policy to take the appropriate actions. If the agent decides to react, regarding its actual state, it injects a new property and spreads a copy of its LSA to all nodes in the network. Otherwise, if it decides to ignore the Bonding eco-law, it will not update its LSA. The process continues until the production of a possible result. The latter is achieved once a final composition schema is produced. Once a composition is completed, the user receives a result and an evaluation process is launched. A user is a human being or another system, for which an agent works on behalf to, and that is able to provide a feedback regarding the provided result. A backward reward is attributed to all agents that participated in the service composition. Agents then use the Reward information R to update their Q function. The compositions tend to have a better reward for pertinent and efficient services. Agents that participated in at least one selected composition schema are rewarded positively, while those that have reacted and contributed to only non-selected schemas receive no specific reward. Similarly, agents that reacted but did not participate in a final schema do not receive any reward. A gradient reward might help avoiding long schema solutions as further partial composition schemas are less rewarded. However, sparse rewards [19] are known to slow down learning. Thus, a continuous reward function could be an alternative. Finally, in RL, reward functions are tricky to choose and depend on the problem. In this paper, we chose to use a fixed reward value \(R = +10\) or \(R = -10\).

Each agent maintains for each state a corresponding Q matrix made of several lines, each line dedicated to a different composition schema. Each line has three columns: state that corresponds to the actual partial composition schema, and two possible Actions (Ignore, React). We initialize each Q matrix with 0 for Ignore and 5 for React, to favor reactions at the beginning of the process or for arriving agents. The values of the Actions are updated with the Q function of Sect. 6.2 and any received reward R. By looking at the values of this matrix, the agent is then able to decide for each partial composition the appropriate action to take.

6.2.1 Feedback

Sometimes, few agents are present in the system and do not get feedback, either because their results are not what the end user expects or their inputs are not consumed by any other agent. Indeed, some of these compositions may never reach a conclusion. (The expected output value is never provided, because no such service exist in the system, and the composition schema remains partial.) In other cases, agents decide to ignore (do not react to) bonding properties. We distinguish the following cases:

  • An agent takes the React action, but no feedback is returned by an end user: Either the agent participates to a partial composition or the user did not yet evaluate that particular composition. In both cases, the agent does not receive any reward and its matrix is not updated.

  • An agent takes the React action, it receives a positive or a negative feedback from the end user: The agent updates its Q matrix accordingly.

  • An agent decides to take the Ignore action (i.e., it decides to not react even though a bonding happened): This agent receives an internal negative feedback from the RL module. As said above, we initialize the Q matrix so that React is favored at the beginning giving the possibility to the agent to receive a feedback (negative or positive). Still, there may be cases, when the agent systematically ends into partial compositions, and receives no feedback. Due to \(\epsilon \), once in a while, the Ignore action is selected. We decided to provide an internal negative feedback in this case. Indeed, agents that never receive feedback (because of partial compositions) when they react progressively learn to actually ignore the bonding and stop reaction.

7 Self-composition and learning in action

Our system needs to learn to respond appropriately to the Bonding eco-law by optimizing the overall system behavior in order to provide the most relevant and reliable self-composed services. The collective interactions among the agents produce all possible compositions including inconclusive or useless ones for the end user. Through learning, agents progressively update their behaviors by following what they have learned based on users’ feedback. The collective adaptive system then converges toward correct and efficient composition, i.e., the ones actually expected by the user.

In this section, we discuss different cases: (1) single agent’s learning; (2) multiple agents’ learning; (3) adaptation to dynamic changes; and (4) integration of QoS for the selection of services.

7.1 One agent’s learning

We present in Fig. 3 a simple composition that provides a weather property after injecting a city name. We revisit this example, including the learning aspect (Fig. 8).

Fig. 8
figure 8

Weather service consuming CityName property as input and providing Weather property as output

Query: As explained in Sect. 4, Agent_2 injected Geneva as a CityName value and requested a Weather property. Here, Agent_2 stands for the end user to our system.

Self-composition: Agent_1 used the Bonding eco-law to consume the CityName value, as it is sensitive to this property, and injected a new Weather property. Agent_2 bonded with the new injected property and got its value.

Reward: The end user expresses his satisfaction by evaluating the returned value through a positive or a negative feedback. Agent_2 uses the reward operation provided by the platform to propagate the user’s feedback to the previous agent. The #B tag in the property marks the agents that bonded for that particular composition. In such a way, the internal engine of our coordination platform keeps track of the agents that contributed to the composition and sends them the reward under the form of an event. The #Q tag references the agent that injected the query, and the #C stands for the composition schema. Upon receiving the feedback, Agent_1 updates its Q matrix accordingly. The next agent’s action will be based on the learned policy. Definitely, a few feedback are needed to have a policy in line with what the end user expects.

Agent: As mentioned above, we fixed: the learning rate \(\alpha =0.3\), \(\epsilon =0.2\) for random explorations purposes, and \(\gamma =0.9\). The Q matrix is initialized with 0 for the Ignore action and with 5 for the React action, so that agents tend to react in the beginning. The Q matrix of Agent_1 contains one line for State \(Weather\mid CityName\). Table 1 shows how this line is progressively updated after each received positive feedback. Indeed, the reward R is either \(+10\) for positive feedback or \(-10\) for negative ones. In this particular example, Agent_1 receives a series of positive feedback from Agent_2. We see at each positive feedback (time 0, 1, etc.), how the Q-value for the Ignore action decreases according to the Q function of Sect. 6.2, while the Q-value for the React action increases.

Table 1 Agent_1 Q matrix update—positive feedback only

7.1.1 Learning curves for a single agent

Figure 9 shows how the agent’s learning curve, for each action, evolves after receiving positive user feedback only. We see how the Ignore, respectively, React, Q values evolve with time. Progressively, the agent learns to systematically React to the composition schema.

Fig. 9
figure 9

Learning curve when only receiving positive feedback for the same query

Figure 10 shows how the agent’s learning curve evolves after receiving negative user feedback only. As we are employing the \(\epsilon \)-greedy algorithm, the agent still takes a random action every few queries (in 20% of the cases). As shown at the fifth iteration, the agent reacts, although the Ignore action value is higher than the React action value in the Q matrix. The agent receives a negative feedback, and the Q matrix is updated negatively for the Ignore action. We then observe regularly these decreasing plateaux, corresponding to the \(\epsilon \)-greedy random action, probing a React action, but still receiving a negative feedback.

Fig. 10
figure 10

Learning curve when only receiving negative feedback for the same query

Figure 11 shows how an agent updates its knowledge when the user feedback changes after a few iterations. At the beginning, the agent learns to react after receiving a few positive feedback. At iteration number 15, the agent receives a negative feedback from the end user. After three such negative feedback, the Ignore action value exceeds the React action value in the Q matrix. As before, the \(\epsilon \)-greedy pushes the agent to explore the React action, but still it receives a negative feedback that further decreases the corresponding value.

Fig. 11
figure 11

Learning curve when receiving positive feedback until the 15th iteration and negative feedback afterward for the same query

Fig. 12
figure 12

Learning curve when receiving negative feedback until the 15th iteration and positive feedback afterward for the same query

Similar to the previous example, Fig. 12 shows how an agent updates its policy when it receives a positive feedback after having been rewarded negatively. At the beginning, the agent learns to ignore the Bonding eco-law for the corresponding query. The Ignore value goes up, while the React value goes down by plateaux (because of the \(\epsilon \)-greedy discussed above). At time 15, the agent starts receiving positive feedback from a probe on the React action. We see then the React action going up and the Ignore one going down. Since the Ignore value is still higher than the React one, the agent still tends to not react, until at some point the React value is higher than the Ignore one, where the learning curves become similar to the ones in Fig. 9.

7.2 Multiple agents’ learning

Figure 4 presents a multiple agent composition involving generic services. In this section, we add a new agent to that example. Agent_4 hosts a service defined by: \(LSA = \{S=[A], P=[C]\}\).

Fig. 13
figure 13

Multiple agent composition involving generic services: Agent_1 receives an answer to his query and sends a reward to Agent_3 and Agent_4. Agent_3 forwards the reward to Agent_2 as it participated in the composition

Let us unfold the scenario as before, including the learning:

Query: Agent_1 injects a query requesting a property C and provides a property of type A.

Self-composition: Agent_2 bonds with the query and provides a B property. This is followed by Agent_3 bonding with Agent_2’s LSA and providing a C property. Hence, Agent_1 bonds with the newly added property and gets the requested property. Now, Agent_4 bonds with the initial query injected by Agent_1 and provides immediately a property of type C, which will be returned to Agent_1 by bonding. As a result, Agent_1 receives two properties of type C, as shown in Fig. 13, resulting from two different compositions.

Reward: The user evaluates randomly one of the returned compositions. We suppose here that, at time 1, it starts evaluating the composition provided by Agent_4 and, for some reasons, Agent_1 rewards positively Agent_4 (e.g., it finds it reliable). Thus, Agent_4 receives a positive feedback and updates its matrix while Agent_2 and Agent_3 do not receive any feedback (no matrix update). At the next time step (\(t=2\)), Agent_1 has the possibility to evaluate the composition provided by Agent_2 and Agent_3. A negative reward is sent to both of them, and no reward is sent to Agent_4 (no matrix update). At time \(t=3,4\), Agent_4 is chosen and rewarded positively.

Agent: The Q matrix of all involved agents is presented in Table 2. Agent_1 is a query agent, and its Q matrix is not relevant here.

Table 2 Multiple agents learning—Q matrix update

After a few feedback, according to their matrix, Agent_2 and Agent_3 will take the Ignore action when they encounter the \(C\mid A\) query. The Q-values are updated in line with a Reward \(R = +10\) for Agent_4 and \(R = -10\) for Agent_2 and Agent_3. When a composition is not selected, no reward is sent, and the matrix values do not change. For instance, we observe that Agent_2 and Agent_3 first two lines of the matrix do not change between time 0 and 1. This is because, they did not receive any feedback from Agent_1 at time 1, only Agent_4 received a feedback. This is similar for Agent_4 at time 2 (no reward received).

7.2.1 Conflicting feedback

Several cases may arise when receiving feedback:

  • An agent engages in a given composition that answers queries from different end users. The end users reward differently the composition in question. In this case, the agents participating in the composition receive conflicting feedback for the same composition schema. Their behavior will then oscillate between ignore and react until one or the other becomes predominant.

  • An agent engages in two different compositions with two different composition schemas. Conflicting feedback in this case will actually correspond to different composition schema (i.e., different states and lines in the Q matrix). The corresponding agents will learn the behavior corresponding to each schema, as they are discriminated in the matrix.

7.3 Adaptation, dynamicity and learning

This section discusses the learning parameters and the adaptability of the system to dynamic changes.

7.3.1 Alpha

The learning rate \(\alpha \) is set between 0 and 1. Figure 14 presents how values in the Q matrix vary according to \(\alpha \). The x-axis, that we named iteration, presents a user feedback under the form of a reward regarding a query. When the learning rate (in the Q function, Sect. 6.2) is close to 1, our system learns faster than when it is set close to 0 (where a higher number of feedback are required). Similarly, an agent changes quickly its behavior after few opposite feedback. Since in our system, agents are not systematically rewarded, and we are dealing with a dynamic environment, \(\alpha \) has to be relatively small in order to limit the sensitivity of the agents to received feedback, accommodating by the way resistance to false negatives and positives feedback. Therefore, we decided to fix \(\alpha \) to 0.3.

Fig. 14
figure 14

The Q-value according to the learning rate \(\alpha \) when the latter varies between 0.1 and 0.9

7.3.2 Epsilon

As said above, we employ the \(\epsilon \)-greedy reinforcement learning algorithm [18]. This algorithm has a probability \(\epsilon \) to select a random action and a probability \(1-\epsilon \) to select the action that maximizes the value of the approximation of Q(sa). \(\epsilon \)-greedy ensures a permanent exploration which is necessary to allow adaptation to changing environmental conditions. We have seen in the previous section, how an agent acts when \(\epsilon =0.2\). In this section, to analyze the learned policy and for simulation purposes, we chose for \(\epsilon \), the function presented as follows in Fig. 15. We progressively reduce \(\epsilon \) during the execution time in order to demonstrate the effect of learning on the system. In practice, we give the system 300 times steps to learn. At this point, \(\epsilon =0.01\), the system then essentially exploits its learning function (only a few explorations happen—1% of the case).

Fig. 15
figure 15

Epsilon variation using a decreasing linear function until the 300th iteration and a constant function afterward

We simulated a scenario where we created:

  • a set of 15 agents providing the following service:

    \(LSA = \{S=[A], P=[B]\}\);

  • a set of 15 agents providing the following service:

    \(LSA = \{S=[B], P=[C]\}\).

We tagged 11 agents from each set by 1 attesting their correctness and the 4 remaining agents from each set by 0. If an agent tagged by 0 participates in the composition, all agents involved in this composition will be negatively rewarded and vice versa. Tags are only used for simulation purposes. In this example, agents need to learn what to do for a query providing a property A and requesting a C property. At time 300, following Fig. 15, all agents have an \(\epsilon \) value equal to 0.01. This allows us to observe the effect of learning. Figure 16 shows how the system progressively converges to the maximum of the possible compositions, which is 121 in the presented scenario. The remaining 104 compositions are not pertinent and are rewarded negatively and rapidly discarded by the agents. Since \(\epsilon = 0.01\), in 1% of the time, agents continue exploring due to random actions.

Fig. 16
figure 16

A composition at run-time based on the learning module following the epsilon function presented in Fig. 15

A high \(\epsilon \) value downgrades learning, since the agent will explore more frequently the system, by making more frequent random choices. A small \(\epsilon \) value lowers the adaptation capability of the system when the environmental conditions change. Therefore, choosing a suitable value of \(\epsilon \) is critical. For the rest of this section, we experiment with the \(\epsilon \) function shown in Fig. 15, while in the previous section and the humanitarian scenario we decided to set it up to \(\epsilon = 0.2\), so that agents explore the system in 20% of the cases.

Fig. 17
figure 17

Learning adapts to agents removal at iteration 400

7.3.3 Agents removal

We keep the same agents, tagging and \(\epsilon \) function presented before in Sect. 7.3.2. Here, we tag 3 more agents of each set with 0 at the iteration number 400. That means, 8 agents out of 15 are tagged with 1 in each set after iteration 400. The three newly tagged agents have to change their behavior and stop reacting to the query to which they responded positively until then. This simulates the departure of 3 agents in each set or a change in behavior rendering them no longer pertinent. Figure 17 shows a similar behavior to Fig. 16 until time 400. After time 400, we see how the number of compositions drops from 121 to 64. The sudden descent at time 400 is due to the high number of feedback (env. 9) received by each agent per iteration, thus inverting their Q matrix curves.

7.3.4 Adding agents

As before, we keep the same agents, tagging and \(\epsilon \) function presented in Sect. 7.3.2. Until iteration 400, the behavior is similar to Fig. 16. At time 400, we set all agents tag to 1. This means 4 more agents in each set (all 15 agents) have to learn to react and participate in all compositions. To simulate the arrival of new agents, we re-initiated the \(\epsilon \) function of these 4 agents in each set, so that at time 400, these agents have now an \(\epsilon =0.3\). As before, after 300 more time steps (at time 700), \(\epsilon =0.01\) also for these agents, and we can consider that the rest of Fig. 18 shows the effect of learning only. We observe again here that after 300 more iterations (around time 700), all agents apply the learned function, and we actually reach 225 compositions. In these examples, the system converges quickly because all agents receive a feedback, in practice, it takes longer for an agent to change its behavior as we return only one result per query.

Fig. 18
figure 18

Learning adapts to agents arriving at iteration 400

These analyses helped us determine the actual learning capabilities and adaptation of the agents when faced with different cases, arrival or departure of agents. For the remaining part of this paper, we decided to set up \(\epsilon = 0.2\), so that agents explore the system in 20% of the cases. This is a trade-off leaving space for learning while allowing adaptation to changes. Finer studies are needed to investigate the most appropriate value for \(\epsilon \) or for adapting \(\epsilon \) to the circumstances.

7.4 Reliable service

Preliminary results show how our system converges to a pertinent composition after few users’ feedback and adapts itself regarding environmental changes. As we have seen from the previous examples, various results can arise from the collective interactions of the agents. Some of these results may be valid answers, some not, some may end up in partial compositions. In terms of quality of service, some results may be better than others. Indeed, services built on-demand are relevant in specific application contexts, particularly in open smart environment and for applications deployed over several nodes. Their performance is variable, and some of them can provide similar services or end user applications. Services will select themselves, change and combine with each other to guarantee and maintain appropriate non-functional properties expressed as quality of service (QoS) [1]. In this section, we introduce a score function. It is used to choose among all relevant compositions and helps identify the most reliable ones, i.e., the ones with the best QoS.

We defined four quality measurements:

  • Time : execution time in seconds

  • Cost : service execution cost

  • Availability : service availability

  • Reliability : service reliability and accuracy

We define a score function \(f:{\mathbb {R}}^4 \rightarrow {\mathbb {R}} \) by

$$\begin{aligned} \begin{aligned}&f(time,cost,availability,reliability) \\&\quad =\dfrac{\omega _1}{time} + \dfrac{\omega _2}{cost} +\omega _3 \times availability + \omega _4 \times reliability, \end{aligned} \end{aligned}$$

where \(\omega _1, \omega _2\), \(\omega _3\) and \(\omega _4\) are weights that can be modified by the end user or the agent requiring the service and where \(\omega _1 +\omega _2 +\omega _3+\omega _4=1\). Depending on the cases, the execution time is more important, while in others the cost is the discriminating factor. Thus, the four above quality measurements allow capturing various cases. Time favors selection of shorter compositions (with less agents involved), or where internal computations are quicker. Cost favors more local compositions, avoiding unnecessary spreading. Availability shows service’s accessibility and readiness to answer to requests, whereas reliability indicates how correct and accurate the service is. Instead of randomly choosing among two or many similar services, agents select the service providing the highest QoS. Figure 19 shows how the QoS helps deciding among various compositions, thus increasing the system’s reliability and performance. In our system, learning allows providing pertinent composition, and QoS selection retains the most reliable ones from the end user perspective. Since the reward takes into account QoS too, the learning scheme allows to retain the compositions that are both the most pertinent and reliable.

Fig. 19
figure 19

Service composition using the learning module and the QoS: Agent_3 requests a C property and provides an A property. Agent_1 and Agent_2 consume an A property and provide a B property. Agent_4 consumes the B property provided by the agent having the highest QoS (in this case it’s Agent_1) and provides a C property

8 Application

We developed a humanitarian scenario, where we consider that international and non-governmental organizations temporarily deploy their services along a geographical crisis area. They arrive and set up their services at different times and location. Similarly, different needs for applications arise at different locations. They all may change over time. To show the utility of the learning for both optimization and reliability, we present a dynamic application arising from collective interactions among several devices. We consider the following actors and services they provide: (1) a doctor (our end user) providing health care, but also in need of specific surgical equipment, drugs or blood bags. In this scenario, the doctor makes corresponding requests; (2) two ICRC (Red Cross) set up tents at two different locations. Each tent provides surgical equipment, drugs or blood bags; (3) various transport services of goods, available through drones or car, become also available in the area; (4) an emergency transport, via helicopter, to evacuate people in need, is also available.

Let us consider that the doctor requires, for instance, blood bags to be brought to his location as soon as possible. He does not know where blood bags, with the right blood type, are available and is not aware of the currently available transportation means. The doctor injects a query asking for transporting blood bags of the requested type at his position.

Fig. 20
figure 20

Humanitarian scenario involving heterogeneous connected devices

The ICRC tents may answer the doctor’s needs by providing information about available blood bags of the requested type. Transport by drones or by car can help carry up blood bags from the tents to the destination where the blood is needed. The helicopter, being also a means of transport, is sensitive to the request also. Here, we show how our system learns to provide the most pertinent and reliable service (with highest QoS). We consider a distributed network composed of six nodes providing the following services:

  • Two ICRC tents

  • Two means of transport (drone or car)

  • A helicopter

  • A mobile phone allowing the doctor to inject queries.

Several compositions arise, assuming all the nodes hosting these services are connected, as shown in Fig. 20. The available services (transport, blood bags, etc.) self-compose thanks to agents coordination. As a result, one composed service is provided to the doctor considering the QoS. The doctor then evaluates the actual result in order to improve reliability for further use. Figures 21 and 22 show the self-composition resulting from the request for blood bags.

We unroll here the scenario:

Fig. 21
figure 21

Blood bags transport composition diagram when querying a mean of transport requesting a property Transport and providing a Destination position and a Blood type

Step 1: A doctor requests to transport a blood bag of type A+ to his geographic position. We can imagine the doctor’s oral query transformed into the suitable LSA format, with a Natural Language Understanding (NLU) system [23]. Agent_1, working on behalf of the doctor, transforms this oral query into a Service request with two provided inputs. LSA: \(\{ S=[Transport], P =[<Destination: x>,<Blood:A+>] \}\) is injected at Node_1.

Step 2: The LSA query spreads to all neighboring nodes.

Step 3: The two agents working on behalf of the two ICRC tents, at Node_2 and _3, are sensitive to the “Blood” property. Upon receiving the LSA query, they execute the corresponding service and update their LSAs by adding two new properties: “BloodBags” and their own respective “Position.”

Agent_2, located in Node_2, announces having two blood bags of type A+, while Agent_3, located in Node_3, announces having four such blood bags.

On the other hand, Agent_6 is sensitive to the “Destination” property. It injects a new property providing a “Transport” property. The various Q matrices are presented below:

State

Ignore

React

Agent_2

\(Transport\mid Blood,Destination\)

0

5

Agent_3

\(Transport\mid Blood,Destination\)

0

5

Agent_6

\(Transport\mid Blood,Destination\)

0

5

Step 4: A copy of these LSAs is spread to all neighboring nodes in the network.

Fig. 22
figure 22

Simplified Blood bags transport composition diagram using the coordination learning-based model and QoS

Step 5: The two means of transport bond with all LSAs providing the “Position” and “Destination” properties. Agent_2 and Agent_3 provide a similar service. Thus, if an agent bonds with more than one same property, it will choose the one provided by the agent having the highest QoS. In this example, Agent_2 has a higher QoS. Two means of transport are available in the proposed network. A car and a drone were identified, respectively, by Agent_4 and Agent_5 in Node_4 and Node_5. They both inject a new property presenting the transport of two blood bags provided by Node_2 from their positions to the destination (the doctor position), by means of “Transport” by a drone or a car. In this case, Agent_2 is favored because it provides a higher QoS than the similar service provided by Agent_3.

Meanwhile, Agent_1 bonds with the first result provided by Agent_6.

The Q matrices of Agent_4 and Agent_5 are presented below:

State

Ignore

React

Agent_4

\(Transport\mid Blood,Destination,Position\)

0

5

Agent_5

\(Transport\mid Blood,Destination,Position\)

0

5

Step 6 : A copy of these LSAs spreads to all nodes in the network.

Step 7: Finally, Agent_1 LSA, waiting for a “Transport” property, bonds with the two other compositions (Step 5), and retrieves the information about the means of transport and blood bags.

Step 8: In our case, Agent_1 selects the emergency transport service by helicopter and returns this result to the user in order to be evaluated as it has the highest QoS.

Step 9: A negative reward is attributed by the doctor as it is not a pertinent service for him, as even though a transport service, it does not provide a service for transporting blood bags. A backward reward is sent to Agent_6.

The latter’s Q matrix is presented below:

State

Ignore

React

Agent_6

\(Transport\mid Blood,Destination\)

3

0.5

Step 10: A second query is injected hoping to retrieve a better answer.

Step 11: This time, the composition arising from Agent_2 and Agent_4 is returned to the user. The doctor evaluates the new result built through self-composition. A positive reward is given and sent to all involved agents.

After few queries, Agent_3, Agent_5 and Agent_6 will learn to ignore the query providing “BloodBags” and “Destination” properties and asking for “Transport.” In addition, Agent_2 and Agent_4 will learn to react as they provide the most pertinent and reliable result.

State

Ignore

React

Agent_2

\(Transport\mid Blood,Destination\)

\({-}\) 3

6.5

Agent_4

\(Transport\mid Blood,Destination,Position\)

\({-}\) 3

6.5

Figure 22 shows a simplified view of the service compositions obtained in this scenario. Through the self-composition process, the system could produce 5 possibilities: transporting blood bags by car or by drone from the two ICRC tents to the doctor (4 possibilities) and a transport through helicopter (not involving blood bags). Due the QoS information, in this example, Agent_2 QoS > Agent_3 and Agent_4 QoS > Agent_5 QoS. During self-composition, Agent_2 and Agent_4 will be favored, providing de facto a single meaningful composition: transport by drone of two blood bags of type A+ from tent of Agent_2. The helicopter transport stands here for a non-pertinent solution. In a larger scenario with identical QoS, we would end up with more pertinent self-compositions among which to discriminate with the user feedback. For test purposes, we designed and deployed five smart nodes equipped with our learning-based coordination model. We attached to these nodes all services presented above, and implemented the scenario presented above using:

  • Two ICRC tents which consist of Raspberry pi 3 hosting a service providing the available blood bags.

  • A Sunfounder Picar-v based on Raspberry pi 3 providing a transportation service.

  • A Parrot bebop 2 drone providing a transport service.

  • An helicopter based on Raspberry pi 3 providing an emergency transportation service.

  • An Android smartphone injecting the service request and evaluate results.

9 Results and analysis: summary

9.1 System operation

Self-composition: Sect. 5 shows how the system actually explores and computes all compositions, possibly complex ones, based on services input and output compatibility using the Bonding eco-law. Limits: The results of all possible composition are not necessarily pertinent. Therefore, we introduced learning.

Self-composition with learning: Sects. 6 and 7 show how the system progressively learns the appropriate compositions based on the users’ requests, providing the pertinent composition (the actual service expected by the user). In addition, the system progressively avoids unnecessary bonding operations and compositions as agents learn the optimal policy. As long as users are providing feedback regarding the returned results, agents can adapt themselves to changes and adjust their behavior. Moreover, the system can automatically detect at run-time the absence or the addition of new agents thanks to the indirect retrieval and injection of property in the shared tuple space. The adaptation speed depends on \(\alpha \) and \(\epsilon \) presented, respectively, in Sect. 7.3.1 and  7.3.2. Limits: Although the proposed platform provides on-the-fly service composition at run-time, it does not provide a personalized learning system. Users could request different services using the same query. Agents do not discriminate between users’ queries.

Self-composition with learning and QoS: Sects.  7.4 and 8 show a refinement of the self-composition strategy including a QoS information, which serves to select the service with highest QoS. This helps to refine services selection based on the weights (\(\omega _1, \omega _2, \omega _3\) and \(\omega _4\)) that an end user can specify for each query. Results show that the system selects the pertinent and most reliable composition (in terms of QoS). In addition, QoS endorses adaptation to changes as QoS is newly calculated at each query. This offers a high competition between agents. Limits: The combination of all local highest QoS does not necessarily provide the highest global QoS. When several services combinations arise, only one is submitted to the user, if it is not pertinent a new query must take place.

9.2 System performance

Initial exploration and convergence of learning: At the beginning, our system needs few queries and feedback to adjust each agent behavior. It produces random solutions caused by the exploration process. Thus, our system provides non-pertinent services to the user. The system needs several feedback from the user to converge toward pertinent services as presented in Sect. 7.

Adaptation to changes: When a change occurs, agents adaptation is not instantaneous and it requires more feedback compared with the initial exploration.

Scalability: The service composition slows down when the network gets bigger and the number of services increases. Ongoing experiments will serve quantifying precisely scalability.

10 Conclusion

In our model, everything is assimilated to services: A sensor feeding data is a service, an actuator opening or closing blinds is a service. Software agents act as wrappers, actually providing the service on behalf of these entities. They also serve to provide, at run-time, reliable self-composed services using reinforcement learning. This helps to refine the returning results and ensure an expected quality of services.

Our approach for self-composition assumes no coupling between the various participating agents, requested services may or may not be available. We showed various cases of self-composition: multiple requests by different end users answered by the same agents and properly provided to the respective requesting agents; multiple services compositions arising from a single initial request; multiple inputs and outputs for the various requests; chains of service composition as well as more complex compositions. We also discussed our reinforcement learning approach and presented preliminary generic results obtained through the implementation of both the self-composition and the learning in our coordination platform. We also showed how reliability and QoS are taken into account.

Future works will focus on deploying and testing our system on large-scale examples, as well as conducting thorough studies about the learning parameters, in particular about \(\epsilon \) and any possibility to have it adapt at run-time. Future works will also concern detailed comparison of a given scenario with and without RL, and comparison with other learning approaches mainly the deep Qlearning.

Coordination models have an impact on the forthcoming Internet of things, mobile devices and Smart cities scenarios. On-demand services with learning capabilities support new generation of services providing adaptive solutions for geographically distributed services [11].