1 Introduction

The development of multi-channel protocols for wireless sensor networks (WSNs) not only improves the network performance in terms of throughput and delivery rate as it is presented in Fig. 1, but also opens a challenge for the design of distributed resource allocation schemes in an energy-efficient way since the energy consumption is the major challenge that faces WSNs.

Fig. 1
figure 1

Single channel Vs multi-channel communication in WSN

In recent years, WSNs have been widely used in the field of the Internet of Things (IoT) in such a way that they represent the interface between the IoT and the real world. Hence, the challenge becomes more important to optimize the network lifetime and performance as much as possible since the sensors are limited by their low-powered sources which are usually small batteries and single antenna radios. Therefore, some multi-channel protocols have been proposed to overcome this challenge and improve the WSN performance. However, in such protocols, it is difficult to fully exploit routing information to achieve a collision-free channel assignment and transmission schedule since nodes must share information dynamically and be involved in the channel assignment and scheduling process, while the requirements of frequent negotiations incur extra-large overhead. In such a case, the problem becomes NP-hard [1, 2], and it becomes even more challenging in large-scale WSNs, where the communication overhead burden should be dispensed across all sensors via distributed protocols.

Consequently, some intelligent protocols have been proposed to overcome the challenge. Some of them are not desirable for WSNs since they need a high level of energy consumption and wasted time in their performance due to their complexity such as the game theories protocols [5]. However, schedule-based reinforcement learning protocols [6, 7, 1, 8] seem more suitable for WSNs from the point of view of their fully distributed nature and their implementation which can take advantage of the routing information with less communication overhead. Implementing such protocols can improve gradually the performances of WSNs based on the principle of action and feedback analysis. Nevertheless, these protocols still require a considerable number of iterations to obtain the best solution since they focus on the single-agent learning approach without cooperation between the agents. Which in turn, costs the network by more energy consumption via the supplementary communication overhead, as well as wasting time. Furthermore, they don’t focus on the energy balancing technique although it represents an important solution for improving the network lifetime since it ensures the lifetime optimization of the node and its corresponding nodes as much as possible.

In this paper, our main contributions are as follows: First, we propose a Cooperative multi-agent Reinforcement Learning for Channel assignment approach to improve the learning process by accelerating the number of iterations in distributed schedule-based WSNs. Second, to ensure energy efficiency, we strengthen the proposed approach by both self-scheduling and load balance methods. Hence, the RL agents schedule selfishly the cooperation flow to assign optimal channels based on the balance of the load between the communicating nodes. Third, we propose two algorithms to investigate the performance of our approach in static and dynamic modes. The first one is the Static Self-schedule based Cooperative multi-agent for Channel Assignment (SSCRL CA), while the second one is the Dynamic Self-schedule based Cooperative multi-agent for Channel Assignment (DSCRL CA).

The objective is to improve the network performance and lifetime in schedule-based multi-channel WSNs by accelerating the RL iterations and reducing the energy consumption through the reduction of both communication overhead and collisions on one side, and the balance of energy consumption on the other side.

In the remainder of this paper, we discuss in Sect. 2 the related works in schedule-based multi-channel WSNs. In Sect. 3, we present our proposed approach and protocols. The obtained results after evaluation and their discussions are presented in Sect. 4. Finally, a conclusion with future works of this study is presented in Sect. 5.

2 Related Works

Several distributed schedule-based multi-channel protocols have been proposed for WSNs to accommodate parallel transmissions using multiple channels. These protocols can be classified into two categories: intelligent and non-intelligent protocols. The non-intelligent protocols assign the channels in either static or dynamic mode. In the static mode, the nodes keep the same selected channels in the time slots of the repeating frame periods, while in the dynamic one, the channels can be changed in each frame period.

The authors of [9] have proposed a dynamic multi-channel MAC (Y-MAC) protocol. In this TDMA based multi-channel MAC protocol, the time is divided into frames. Each frame is divided into a Broadcast sub-period that takes 3 slots, and Unicast one for the remaining slots. In the Broadcast sub-period, a common channel is used to exchange control messages in order to coordinate the communication that will be in the second sub-period. In [10], a static scheduled-based multi-channel MAC protocol is proposed. In this protocol, the neighboring nodes are splitting into different groups and the time slots in the frame period are divided between these groups. The receiving nodes then, select the channel-slot pairs in the group sub-period that are not chosen in tow-hops neighboring nodes. In [11] a static Multi-Channel MAC (MC-MAC) protocol is proposed. In MC-MAC, the neighboring nodes start by exchanging their frame vectors that indicate the used channel in each time slot then select the time slots to be used on a particular channel in a collision-free manner. In [12], the authors have proposed a dynamic Regret Matching based Channel Assignment algorithm (RMCA). RMCA focus on the fact that each sensor node updates its choice of channels according to the historical record of these channels’ performance parameters which is based on the success of the transmissions and delay to reduce interference. The authors of [13] have proposed a static joint time slot and frequency channel allocation algorithm that bases on the loopy belief propagation (BP) approach using a factor graph. The algorithm imprints space, time, frequency and radio hardware constraints into a loopy factor graph and performs iterative message-passing loopy belief propagation with randomized initial priors.

Nevertheless, the non-intelligent protocols generally require the overall knowledge of the external environment to perform their decisions which results in a high amount of message transfer leads to a high cost of energy consumption.

The intelligent protocols are distinguished from the non-intelligent ones by using only the feedback acquired from the external environment in the channel selection process to obtain approximately the best selection. These protocols can be classified into two categories: Game-based and Learning-based protocols. The Game-based methods involve the application of the game theory rules to study strategic decisions for optimal channel selection.

In [5], the power levels at which a node should transmit are considered and the investigation for the existence of Nash Equilibrium (NE) is done for two different cases, with xed channel conditions, and with varying channel conditions. [3] focuses on optimizing the network performance through using two controlling techniques: nodes transmission power and communication interference. Therefore, after the game period, the node with low residual energy chooses a lower transmission power. the authors of [14] investigated channel allocation to only the receiving nodes by allocating different channels for adjacent nodes. The nodes then, use the receiving channels information to select different channels by giving priority to low energy nodes. The authors of [4] have proposed a centralized scheme for spectrum allocation modeled as a multi-objective optimization problem for maximizing spectrum utilization, proportionally fair allocation, transmission priority among nodes and avoiding unnecessary spectrum handover. Then they have used a modified cooperative game to deal with the multi-objective optimization problem as a single-objective function.

Even though the game-based methods may lead to a better solution in terms of interference-free, the large number of iterations in the game phase to obtain the NE results in high costs in terms of energy consumption and delay, which leads to inappropriate adaptation for large scale WSNs.

Learning-based methods emphasize the application of the Reinforcement Learning (RL) approach that is characterized by the greedy mode, which is more suited to the WSN’s nature [15]. The RL approach has already been used to propose solutions to several single-channel problems in WSN, such as routing and task scheduling [16, 15]. However, in the last years, it has been used to deal with the distributed schedule-based channel assignment problem using a single agent RL.

The authors of [8] have proposed a Normal Equation based Channel quality prediction (NEC) algorithm that starts by performing channel rank measurement (CRM) based on the received signal strength indicator (RSSI) and the average of the link quality indicator (LQI) for each channel. The set of accepted channels is then used for training a machine learning operation between the neighboring nodes. In [7], a trade-off between two methods is proposed, an on ligne non-cooperative game and an offline schedule-based machine learning for channel assignment. The first method is used if the network can satisfy the energy requirements, otherwise, the second one is used dynamically. In [6], the authors propose the Coverage and Connectivity Maintenance algorithm based on RL (CCM-RL) to give the node the opportunity to take the best action in order to maximize the coverage rate and maintain network connectivity. In [1], the schedule-based multi-channel communication protocol that performs upon RL MMAC (Reinforcement Learning Multi-channel MAC) algorithm is proposed. It represents an extension work of the one proposed in [2]. In RL-MMAC, the nodes learn to perform their transmission schedule on their parent’s channels in a distributed manner using the “win stay, lost shift” strategy. After the learning process, the best action in each slot is chosen if its probability of success exceeds the sleeping threshold.

The cooperative multi-agent RL approach is used to build a cooperative machine learning system between more than one agent which allows multiple agents to learn together utilizing one another’s strength for decreasing individual learns weaknesses and enabling learning to be accelerated. Nevertheless, the use of this approach opens up new necessary issues to be tackled that can be resumed in the response of the three following questions: Why, How and when the multiple agents can learn together?

To answer these questions, several solutions have been proposed in the Artificial Intelligence multi-agent RL sub-field for Transfer Learning (TL) depending on the cases for which these solutions are proposed. Hence, the cooperation between the RL agents can be in many kinds such as coordination, competition and advising [17]. Nevertheless, these solutions can not be applied directly to the WSNs without considering their resource-constrained. For this reason, some other solutions have been proposed to deal with the TL between the RL agents in a decentralized WSN focusing on the energy efficiency factor such as in [18,19,20] for task scheduling. However, these solutions are based on a single channel.

In multi-channel communication, the cooperative multi-agent RL focus especially on Cognitive Radio (CR) networks for spectrum sensing to avoid primary users as in [21, 22]. However, CR networks differ from WSNs in such a way that CR devices have more power and capabilities than those of WSNs.

In [23], the authors propose the Collaborative Multi-agent Anti-jamming Algorithm (CMAA) to avoid unsecured channels on one side and compete for the best channel to be used in each time slot on the other side in wireless networks.

However, these solutions don’t take into account the scheduling issue in the transfer learning process, which can result in a high cost of collisions and overhead communication leading to more energy consumptions and negative transfers. Also, most of these protocols don’t focus on energy balancing technique that represents an important solution for improving the network lifetime.

As an alternative to these limitations, we propose SCRL CA ( Self-schedule based Cooperative multi-agent Reinforcement Learning for Channel Assignment) approach which performs channel selection based on the self-schedule scheme in a cooperative learning manner to improve the performance and lifetime in schedule-based multi-channels WSNs. Also, we investigate the performance of our approach through two protocols, one for the Static mode (SSCRL CA) and the other for the Dynamin mode (DSCRL CA).

3 Self-schedule Based Cooperative Mumti-agent Reinforcement Learning for Channels Assignment (SCRL CA)

3.1 Self-schedule Approach

In WSNs, the use of the routing metrics focuses generally on the receiver selection process. Hence, the nodes have to collect these factors periodically from their neighboring nodes then they use it to select the optimal receiver node which results in a high cost in terms of communication overhead and thus energy-wasting. In contrast, the authors of [24] have used these factors in balance by a self-scheduling manner for data routing in WSNs based single-channel to reduce the communication overhead caused by the exchanged information between the neighboring nodes. For this purpose, each node that has to send a data message starts by sending an RTS (Request To Send) control message to its neighboring nodes, and then the receiver nodes must wait a time measured by (1) before the response by a CTS (Clear To Send) message. In this way, the node which has waited the smallest time responds first, and the other waiting neighboring nodes learn from this response that the demand is satisfied without sending its CTSs control messages.

$$\begin{aligned} t = \frac{\left( \frac{1}{d}\right) \times \gamma }{E\times r} \end{aligned}$$
(1)

where, d is the distance of the node from the sink, E is its self-residual energy, \(\gamma\) represents the number of neighboring nodes and r is the link reliability.

Although the results demonstrate the efficiency of these scheme in term of communication overhead reduction, it has some limitation that can be resumed as follow: first; it doesn’t perform the collision-free aspect for the sender nodes since it focuses only on the receiver ones without taking in account that the sender nodes can send RTSs simultaneously. Second; it uses a free waiting time which can be very long even for the smallest one, which in turn increases the wasting time and energy consumption.

3.2 System Model

We dene a WSN as a set of N nodes denoted us N=\(\{1,...,N\}\), randomly dispersed throughout an area. Each node uses a predefined set of orthogonal channels K=\(\{ f_{1},...,f_{k}\}\) in a range of communication R, for example the non-overlapping channels proposed by IEEE802.15.4 standard (16 channels in 2.4 GH band and 10 channels in 915 MH band) [25]. All the nodes can communicate with the sink node in a multi-hop fashion through the communication with their M neighbors located within their communication range.

After the initialization phase, all the nodes are synchronized with the global time clock of the sink. Hence, the time is divided into consecutive frame periods which are in turn, divided into a fixed number of slots. Tow types of slots have been considered, Beacon slots for exchanging control information (Beacon messages, new request messages, etc) on the same dedicated channel, and Data slots for data transmission. Furthermore, each node keeps the lowest number D of hops from the sink node as its depth and a list of its parent nodes selected from the list of the neighboring nodes as well as the maximum number of parent lists of the neighboring nodes. The parent nodes are the neighboring nodes that have a smaller D than that of the node.

3.3 Problem Formulation

To perform both channel and time (channel-time) schedules for each time slot, a cooperative multi-agent reinforcement learning approach is applied. Hence, the channel-time schedule problem can be formulated as a Markov Game ( [17]) that can be expressed mathematically as MG=\(\{n,S,A,T,R_{1...N},\gamma \}\), where;

  • n is the number of agents,

  • S is the set of states of all agents,

  • A is the joint action space composed of local actions for all the agents,

  • T:S\(\times\)A\(\times\)S\(\longrightarrow\)[0,1], is the state transition function,

  • \(R_{i}\):S\(\times\)A\(\times\)S\(\longrightarrow\) \({\mathbf {R}}\) is the reward function of agent i.

  • \(\gamma\) \(\in\) [0,1), is the discount factor, which represents the relative importance of future and present rewards.

Hence, we define the number of agents n by the number of all the nodes N, so each node is considered as an agent. The set of actions for each node is defined as \(A_x=\{f_{1},...,Sf_{k},Rf_{1},...,Rf_{k}\}\), where \(A_x\)is the set of actions for the node x, \(Sf_{i}\) and \(Rf_{i}\) mean send on channel \(f_{i}\) and receive on channel \(f_{i}\) respectively. Therefore, \(A_1\)=\(A_2\)=...=\(A_N\). In order to avoid complex calculation that is not desired for the WSN, we use the stateless variant of the Q-learning method that has been demonstrated its efficiency for distributed learning problems [26]. Hence, the state transition function is chosen based on a trade-off between the strategy “win stay, lost shift” used in RL MMAC [1] and the stateless Q-learning method. The “win stay, lost shift” strategy plays an important role in the learning acceleration process by the fact that if the agent fails in performing action after some successes, it does not lose time in the gradual return. Therefore, T is fined by (2) :

$$\begin{aligned} Q_{x,t+1}(a_{t}) =(1-\alpha ) Q_{x,t}(a_t) + \alpha r_{x}(a_t) \end{aligned}$$
(2)

where, \(Q_{x,t+1}(at)\) means the Q value update at time slot t + 1 (the slot in the current frame) by agent x, after executing at time slot t (the same slot in the last frame), the action \(a_t\). \(r_x(a_t)\) means the immediate reward calculated by (3) after executing the action \(a_t\) at time slot t by agent x. Hence, the reward takes the value (+1) if either the receiver node receives a data message or the sender node receives an Acknowledgement (Ack) message on the chosen channel.

$$\begin{aligned} r_{x,t}(a_{t}) = {\left\{ \begin{array}{ll} +1 \quad \text { if it is a successful communication} \\ -1 \qquad \text { otherwise} \end{array}\right. } \end{aligned}$$
(3)

\(\alpha\) is the learning rate parameter which can be set to a value in [0,1]. In order to formulate the “win stay, lost shift” strategy that it is mentioned in 28 without formulation, we use the “win or learn fast” for variable learning rate method proposed in [27], where a small value of \(\alpha\) is used for successful actions and a higher value is used for unsuccessful ones. Therefore, \(\alpha\) is done by (4) as follow:

$$\begin{aligned} \alpha = {\left\{ \begin{array}{ll} 0.01 \quad \text { if r > 0} \\ 0.1 \qquad \text { otherwise} \end{array}\right. } \end{aligned}$$
(4)

The best joint action to be selected in the next time slot is done by(5);

$$a^{*} \in Max_{{a \in A_{{av}} }} \left[ {\sum\limits_{{x = 1}}^{{NG}} {Q_{{x,t + 1}} } (a)} \right]$$
(5)

where, \(a^*\) is the best joint action to be selected in next time slot by looking at the maximum of Q values sums after performing actions among the available action set \(A_{av}\) by the Neighboring nodes Group (NG) since the cooperation between the agents is reduced to that between the neighboring nodes. To perform (5), each node must collect messages from its neighboring nodes according to the cooperation model explained in the next sub-section.

The discount factor \(\gamma\) is not taken since we use the stateless method.

3.4 Cooperation Model and Algorithms

In order to perform the cooperation process in an energy-efficient way, we have used prior coordination based on “social conventions” strategy used in [28] to coordinate between Unmanned Aerial Vehicles (UAVs) for field coverage. For that, the UAVs coordinate the action to be selected in advance by the fact that each UAV must not choose the action chosen by the others. To perform this coordination method, specific ranking order is assigned to each UAV. The one with the highest order selects action first and lets the others know its action. The other UAVs then can match their actions with respect to the prior selected ones.

To adapt this method to the WSN in an energy-efficient way, we have used a Self-scheduling mechanism between the neighboring nodes. Hence, before performing actions in the selected channel, the neighboring nodes select actions sequentially by informing one another. To do that, the neighboring nodes must use the same dedicated channel. Therefore, each data time slot in the frame period is divided into two periods: a broadcast period \(T_{B}\) in which all nodes turn back to the dedicated channel, and unicast period \(T_{U}\) in which the communicating nodes use the selected channel for sending data messages. Note that, the dedicated channel can be used in \(T_{U}\) period. The \(T_{B}\) period is divided into two equal and consecutive sub-periods: \(T_{RTS}\) and \(T_{CTS}\) sub-periods as it is shown in Fig. 2.

Fig. 2
figure 2

SCRL CA time slot structure

\(T_{RTS}\) sub-period is used for sending Request To Send (RTS) control messages, while \(T_{CTS}\) sub-period is used to send the Clear To Send (CTS) ones by the parent nodes. The sending of control messages is performed in a self-scheduling manner to reduce the communication overhead, balance the remaining energy and avoid collisions. Therefore, in the \(T_{RTS}\) sub-period, the sender nodes schedule the sending of their RTSs. However, in \(T_{CTS}\) sub-period, the receiver nodes schedule and send the CTSs as follow:

Based on the positive number of its parent nodes (Parents Degree (PD>0)), the maximum of parents degree (Max Parents Degree MaxPD) between the neighboring nodes, the data queue size (\(\lambda\)), and the residual energy \(E_R\), each sender node selects a random waiting time (\(RT_{WS}\)) bounded by \(T_{WS}\) in \(T_{RTS}\) sub-period, before transmitting RTS control message. Note that, the node that has no parent, except the sink, can neither send nor receive since it is excluded. The \(T_{WS}\) is calculated by (6).

$$\begin{aligned} T_{WS} = T_{RTS}^{(P_{WS}/(S_N+1))} \end{aligned}$$
(6)

Where;

$$\begin{aligned} P_{WS} = \left( \frac{1}{\lambda }\right) \left( \frac{PD}{MaxPD}\right) \left( \frac{E_R}{E_{Init}}\right) \end{aligned}$$
(7)

\(S_N\) is the number of successful communications started by a value 0, and \(E_{Init}\) is the initial energy of the node.

$$\begin{aligned} T_{RTS} = T_{CTS}= \frac{T_{B}}{2} \end{aligned}$$
(8)

Hence, by (6) we can ensure a collision-free self-scheduling between the sender nodes within the first half of \(T_B\) \((T_{RTS})\) sub-period. Thus, (7) gives priority for the node with less residual energy and parents degree in one hand, and with more data messages in its queue in the other hand, to wait less time in order to preserve energy and prefer the sender with more data messages and fewer parents degree to send first, since nodes with more parents degree have more chance to relay its data messages. In addition, the \(T_{WS}\) is decreased gradually based on the number of successful communications \(S_N\) to give priority to the sender nodes that have succeeded to remain winners. Furthermore, the sender nodes choose their actions by performing the formula (5) in a sequential manner. Therefore, the first sender node chooses its action, then it broadcast its choice, the other sender nodes, as well as the other neighboring nodes, learn from this choice by adding the sent Q value to their Q tables. Then the following sender nodes will be forced to choose other actions, among the available actions, sequentially. The different steps in the \(T_{B}\) period are presented in the flow chart of Fig. 3.

Fig. 3
figure 3

Cooperation model and scheduling process in \(T_{B}\) period

On the other side, the parent nodes receive in \(T_{RTS}\) sub-period, the RTS control messages sequentially depending on the priority of the sender nodes. The parent nodes then select a random waiting time (\(RT_{WR}\)) between \(T_{RTS}\) and \(T_{WR}\) times in \(T_{CTS}\) sub-period, before the sending of their responses (CTSs). The \(T_{WR}\) is calculated by (9) as follow:

$$\begin{aligned} T_{WR} = T_{RTS}+ T_{CTS}^{(P_{WR}/(S_N+1))} \end{aligned}$$
(9)

Where;

$$\begin{aligned} P_{WR} = \left( \frac{(MaxPD+1)-PD}{MaxPD+1}\right) \left( 1- \frac{E_r}{2E_{Init}}\right) \end{aligned}$$
(10)

Hence, by (9) we ensure a self-schedule with collision-free between the receiver nodes based on residual energy and parents degree. The aim is to balance the remaining energy between the neighboring nodes on one hand and select the best path to optimize delivery delay on the other hand. Also, the \(T_{WR}\) is decreased gradually based on the number of successful communications \(S_N\) to give priority to the sender nodes that have succeeded to remain winners. Thus, (10) prefers the parent with more energy and parent degree to respond first. Therefore, the other parent nodes, as well as the other neighboring nodes, learn from the previous responses and respond to the other RTS control messages sequentially. To do this, each receiver node updates its RTS queue after the waiting of \(RT_{WR}\) during which it listened to the other CTSs (the case is mentioned by (*) in Fig. 3). The updating process focuses on delating of the RTSs that belong to the following cases:

  • The RTSs for which the node is a parent, and are not answered by the previous parent nodes, but they select a channel that is used by the previous responses.

  • The RTSs for which the node is a parent, but they are answered by the previous parent nodes,

  • The RTSs for which the node is not a parent,

Then, the parent selects the RTS which corresponds to the best available response action. In this way, we ensure collision-free communication that avoids both collision types: direct and indirect collisions. Furthermore, by (8) we ensure a self-schedule with collision-free between the receiver and the sender nodes.

figure d

As mentioned before, we have used two algorithms: Dynamic Self-schedule based Cooperative multi-agent Reinforcement Learning for Channel Assignment (DSCRL CA) that is presented in Algorithm1, and Static Self-schedule based Cooperative multi-agent Reinforcement Learning for Channel Assignment (SSCRL CA) that is presented in Algorithm2.

figure e

In SSCRL CA, we have used sleeping probability (\(P_{SLEEP}\)) for each slot that is initialized by the value 0 and increments if there is no action to do in \(T_{U}\) period (line 13). After the learning period, each node looks for each slot at whether its sleeping probability (\(P_{SLEEP}\)) is greater than a threshold value \(\Delta\). If it is the case, it takes the sleeping action during this slot as the slot action (lines 19, 20), otherwise, it takes the best-performed action in the learning period as the slot action (lines 22, 23). Furthermore, the data message size is increased to satisfy the slot size in the absence of \(T_{B}\) period after the learning phase since the data message size can achieve 128 bytes in IEEE802.15.4 for example [25]. In addition, The new nodes can discover their parents through the beacon messages that are transmitted periodically in Beacon slots, and start the learning process with the sleeping slots of their parents. In order to accelerate the learning process of the new nodes, each parent keeps a blacklist of channels (\(Bl_{Chs}\)) for each slot, created during the learning phase, which includes the different used channels by the neighboring nodes in the concerned slot. During the negotiation between the new node and its parents, each parent sends the list of sleeping data time slots as well as the \(Bl_{Chs}\) for each data time slot to be excluded from the available actions of the concerned slot.

4 Simulation and Results

In this section, we present the simulation results of our protocols in comparison with both RL MMAC [1] which uses the same distributed tree topology in static mode and CMAA [23] that is characterized by its dynamicity. The different protocols are implemented using Java language since it is one of the most popular and attractive programming languages especially for building flexible, portable and high-performance network applications. Several simulation tools and libraries based on Java have been developed for the simulation of wireless networks.

We have used JSensor simulator [29] since it uses a parallel simulation based on the available processor cores number. In order to implement the different protocols, we modify JSensor by integrating the following aspects:

  • Time Division Multiple Access (TDMA) scheduling mechanism

  • Multi-channel mechanism: for which, the nodes can communicate only using the same channel.

  • Energy consumption mechanism: For which we have used the same model for calculating the communication energy dissipation that is used in [30]. The energy spent (\(E_{Tx}\)) for the transmission of k-bit packet over a range R is given by (11):

    $$\begin{aligned} E_{Tx} = E_{Elect}\times K + E_{amp}\times K\times R^2 \end{aligned}$$
    (11)

    Where, \(E_{Elect}\) is the required energy for activating the electronic circuits. \(E_{amp}\) is the required energy for amplification of transmitted signals to transmit one bit in open space. \(E_{Elect}\) and \(E_{amp}\) are fixed at \(5*10^{-8}\) and \(3*10^{-8}\) respectively.

    Energy consumption to receive a packet of K bits (\(E_{Rx}\)) is calculated according to (12):

    $$\begin{aligned} E_{Rx} = E_{Elect}\times K \end{aligned}$$
    (12)

    Energy consumption in idle time T (ms) (\(E_{Ix}\)) is calculated according to (13):

    $$\begin{aligned} E_{Ix} = E_{Elect}\times T \end{aligned}$$
    (13)

An example of the implemented schedule-based process for the SSCRL CA learning period is shown in Fig. 4. For the reason of the presentation, we have used 5 frames with 4 time slots only.

Fig. 4
figure 4

Run result showing schedule-based process in SSCRL CA learning period

We run simulation experiments with 100 nodes placed randomly in an area of 500m * 500m. The sink is placed in the middle and the transmission range is fixed at 50 meters. The leaf nodes generate data messages every 10 seconds. The different results are averaged over 10 simulations.

Two scenarios are token, the comparison of learning period between RL MMAC and SSCRL CA, then the comparison of network lifetime and performance for a long time measured by 300 minutes between the three protocols CMAA, SSCRL CA and DSCRL CA.

For the first comparison shown in Fig. 5, six metrics are token over the variation of the number of channels:

  • End to end packets delivery ratio,

  • Total energy consumed ratio,

  • Total energy consumed ratio of overhead,

  • Number of colliding packets,

  • End to end latency,

  • End to end hops rate.

To evaluate the impact of the number of time slots per frame, we have used the parameter \(\alpha\) which is defined as the ratio of the number of time slots per frame to the number of periodic generated data messages by the leaf nodes. The length of the slot is changed from RL MMAC to SSCRL CA in such a way that in RL MMAC, the token is 40 ms, while in SSCRL CA, it is 60 ms. The aim is to use the same packet length considered by 64 bytes in 40ms and exploit the first 20 ms in SSCRL CA for the \(T_B\) period. As it is mentioned in [1]for the length of the learning period that depends on the maximum path to the sink node, we have used 15 frames which is more than enough.

Fig. 5
figure 5

Comparison of a End to end packet delivery ratio, b Total energy consumption ratio, c Total overhead energy consumption ratio and (d) The collided packets per number of channels in Learning period

Figure 5 shows that SSCRL CA can reach the total end to end packets delivery for all the 10 experiences in \(\alpha\)=1.5, while RL MMAC succeeded for more than 2 channels in all the experiences and only for two experiences in 2 channels (Fig. 5a). Furthermore, SSCRL CA can optimize the delivery packets by more than 50\(\%\) in the other values of \(\alpha\). The energy consumed in the learning phase is greatly reduced by SSCRL CA in different situations (Fig. 5b). This reduction can achieve 80\(\%\) of the energy rate consumed by RL MMAC. The reason can be explained by the fact of the high communication overhead generated by RL MMAC since it broadcasts the data packets instead of the control packets used in SSCRL CA (Fig. 5c), as well as the significant reduction of the collisions performed by SSCRL CA that can be reached 78\(\%\) due to the self-scheduling mechanism that is used (Fig. 5d).

In Fig. 6, we have taken the values of \(\alpha\) for which there is a total end to end packets delivery to investigate the latency and the average of the hops taken by all the data messages from the source nodes to the sink in the learning phase of the two protocols. Note that in Fig. 6a, we have taken only \(\alpha\)=1.5 for RL MMAC since it performs the total reach at this value. Therefore, the latency of SSCRL CA is better than that of RL MMAC if the frame periods are identical since the frame period of SSCRL CA in \(\alpha\)=1 is identical to that of RL MMAC in \(\alpha\)=1.5. However, if the \(\alpha\) values are the same, RL MMAC performs well in latency than SSCRL CA since the time slot in the frame period of RL MMAC is lower than that of SSCRL CA by 33 \(\%\). Figure 6b shows the path length rate taken by all the data messages to the sink node by the two protocols in the two values of \(\alpha\) (1,1.5). Hence, SSCRL CA continually reduces the length of the path according to the number of used channels by an increasing rate that can reach 36.48 \(\%\) on 8 channels at \(\alpha\)=1.5. This can be explained by the effect of the default channel mechanism used by RL MMAC since the tracking of the default channels can result in very long paths, while SSCRL CA gives priority to build the smallest paths to the sink as much as possible based on the dynamic channel selection mechanism that is used.

Fig. 6
figure 6

Comparison of a Total end to end delivery and b Hops rate per number of channels in Learning period

For the second comparison shown in Figures (Figs.7, 8, 9), we set the value of \(\alpha\) at 1.5 to investigate the network energy consumption and delivery using the three protocols CMAA, SSCRL CA and DSCRL CA for a long time considered by 300 minutes. Therefore, four metrics are token over the variation of the number of channels:

Fig. 7
figure 7

Comparison of a Dead nodes ratio and b First dead node per number of channels in 300 minutes of communication

  • Dead nodes,

  • The first dead node,

  • Delivery ratio in bit per second,

  • Total energy consumption ratio of overhead.

Figure 7 shows the dead nodes in 300 minutes of performance using the three protocols. Hence, DSCRL CA is the best one at using of few channels (2,4) in such a way that it does not suffer from any dead node using 2 channels and suffers from the lowest dead nodes ratio using 4 channels. However, unlike the other two protocols, the number of dead nodes in DSCRL CA increases relatively with the increase in the number of the used channels, which makes SSCRL CA the best one at using more than 4 channels. This is due to the increasing level of the communication overhead in \(T_B\) period relatively with the increase in the number of the used channels, which leads to an increase of supplementary energy consumption that is shown in Fig. 9, and which has overcome the existed energy balance mechanism in 6 and 8 channels use cases. The delivery ratio in Fig. 8 is optimized especially in SSCRL CA in an increasing manner according to the number of used channels. Hence, The optimization is increased from 6.07\(\%\) using 2 channels to 60.42\(\%\) compared to CMAA delivery ratio at 8 channels. The reason returns to the avoidance of the communication overhead as well as the increasing of the packet length after the learning period, which is performed in a successful manner compared to the CMAA as it is explained above. However, the delivery ratio increases slowly in DSCRL CA. It goes from the rate of 10.18 \(\%\) at 2 channels compared to CMAA to the rate of 15.02 \(\%\) at 8 channels, and this is due to the costs of supplementary overhead as it is shown in Fig. 9 that’s mention an improvement of DSCRL CA compared to CMAA due to the self-scheduling mechanism.

Fig. 8
figure 8

Comparison of delivery ratio in 300 minutes of communication

Fig. 9
figure 9

Comparison of total overhead energy consumption in 300 minutes of communication

5 Conclusion

In this paper, we propose a distributed cooperative multi-agent reinforcement learning approach which bases on the self-schedule scheme for channel assignment in schedule-based wireless sensor networks. The main challenge of such networks is energy consumption due to the communication overhead, collisions and unbalanced energy consumption that drastically reduce the network lifetime. For this reason, special consideration is given to reduce both the communication overhead and collisions and balance the energy consumption as much as possible based on the self-schedule scheme that plays an important role in accelerating the RL iterations. We investigate the use of our approach through two protocols, static (SSCRL CA) and dynamic (DSCRL CA) protocol. The proposed protocols enable nodes to implicitly schedule and adapt their energy consumption in an efficient way based on the availability of the channels. They involve a distributed channel selection in a self-schedule manner based on the routing metrics and they take the energy as the base metric by enabling nodes to reduce communication overhead and collisions, but also to balance the energy consumption between them in order to improve the network lifetime in addition of the throughput and latency in static and dynamic fashions. Through simulations and experiments, we demonstrate the effectiveness of our approach through the two protocols. As a result, our approach signicantly reduces the energy consumption and improves the network lifetime with good amounts of packet delivery ratio in both protocols in the cases of using of a small number of channels, however, the performance of the dynamic protocol (DSCRL CA) is deteriorated compared to that of the static one (SSCRL CA) in using of a high number of channels. As future work, we propose a hybrid method that uses the dynamic method in using a small number of channels and it substitutes into the static method by using a high number of channels. Furthermore, it is suitable to implement our protocols in real-life applications to perform real evaluations [8].