1 Introduction

Wireless sensor network (WSN) [1] is comprised of a large number of sensors and sink nodes to monitor events or environmental parameters, such as temperature and humidity, in a collaborative manner. The sensor nodes collect data and send it to the destination sink nodes in single or multiple hops; while the sink nodes process the data in order to provide meaningful information to end users. The WSN has seen numerous potential applications in medical field, disaster recovery and wildlife monitoring.

Generally speaking, each sensor node operates on battery power. Two main factors affect energy consumption. Firstly, the state of the transceiver in which energy consumption is high during transmission, reception, idle (or overhearing), and low during sleeping. Secondly, events other than successful packet transmission, including collision, retransmission and control packet transmission, incur energy consumption. Enhancing energy efficiency to prolong network lifetime without jeopardizing network performance has attracted a considerable research attention, and has been part of the objective of most schemes in WSNs because sensor nodes may be deployed at hard-to-reach areas.

In recent years, there has been an increasing interest in the application of an artificial intelligence approach called Reinforcement learning (RL) [2] to various schemes in WSNs in order to improve network performance. The RL approach adopts an unsupervised and online learning technique. Through unsupervised learning, external teacher or critic is not required to oversee the learning process; and so, a decision maker (or an agent) must make its own efforts to learn knowledge about the operating environment. Through online learning, an agent acquires knowledge on the fly while carrying out its normal operation; and so, empirical data or experimental results from the laboratory are not required. A wide range of schemes can be represented using RL models, and subsequently various network performances can be improved using RL algorithms.

Although extensive research has been carried out on a wide range of schemes in WSNs, no single study exists which adequately covers distinctive RL models and algorithms that have been applied, and so this is the focus of this article. The rest of this article is organized as follows. Sections 1.1 and 1.4 present an overview of RL and application schemes of WSNs, respectively. In the context of WSNs, Sect. 2 presents various components, features and enhancements of RL, while Sect. 3 presents various RL models and algorithms. Section 4 presents performance enhancements brought about by RL in various schemes. Section 5 presents open issues, and finally Sect. 6 presents conclusions.

1.1 Overview of reinforcement learning

This section presents an overview of the RL model and Q-learning.

1.2 RL model

Figure 1 presents a simplified version of a RL model. The purpose of RL is to estimate the long-term reward of each state-action pair through trial-and-error interactions with the operating environment. There are three main representations. Firstly, state represents the decision-making factors (or the operating environment) under consideration being observed by an agent. Examples are residual energy and the number of packets in the buffer queue. Secondly, action represents an optimal action being selected by the agent, which may change or affect the state and reward. Examples are selecting transmission power and selecting a next-hop node for packet transmission. Thirdly, reward represents the gains or losses in network performance for taking an action on a particular state in the previous time instant. Examples are throughput and energy consumption level.

Fig. 1
figure 1

A simplified RL model

At any time instant, an agent observes state and reward from its operating environment, learns the long-term reward of each state-action pair, decides and carries out an appropriate action on the environment so that the state and reward, which are the consequences of the action, improve in the next time instant. The agent interacts with the operating environment in a trial-and-error manner, and so given a particular state, an agent learns to carry out the optimal action as time progresses in order to improve the next state and reward.

The RL model in Fig. 1 can be embedded in each sensor node [3], or in the surrounding area of a sensor node [4]. For instance, each sensor node keeps track of the reward in regards to each neighboring sensor node in [5], and for each grid point in its surrounding operating environment in [4]. As an example on the application of RL in WSNs, it is used to learn the optimal route in routing (see Sect. 1.4). The state represents a destination (or sink) node, action represents the selection of a next-hop node to forward packets, and reward represents the progress in terms of the physical distance towards the sink node. Maximizing reward reduces distance to the sink node, which enhances network performance.

There are two main advantages of RL. Firstly, it models network performance which covers most factors affecting the performance rather than each of the factors itself, and so this simplifies the design. Secondly, it learns on the fly during normal operation, and so it does not require prior knowledge of the operating environment. For instance, a sleep-wake scheduler aims to reduce energy consumption through sleeping for the right duration at the right time, and so traffic loads at neighboring nodes are pertinent to determine this duration although this information may not be known to the scheduler.

1.3 Q-learning

Q-learning [6] is a popular technique in RL. Denote decision epochs by \(t\in T=\{1,2,\ldots \}\), each agent \(i\) updates the Q-function \(Q_{t+1}^i ( {s_t^i ,a_t^i })\) of a particular state-action pair at time \(t\) as follows:

$$\begin{aligned} Q_{t+1}^i ( {s_t^i ,a_t^i })\leftarrow ( {1-\alpha })Q_t^i ( {s_t^i ,a_t^i })+\alpha \left[ {r_{t+1}^i ( {s_{t+1}^i })+\gamma \max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})} \right] \end{aligned}$$
(1)

where \(s_t^i \in S\) is state, \(a_t^i \in A\) is action, \(r_{t+1}^i ( {s_{t+1}^i })\in R\) is delayed reward, \(0\le \gamma \le 1\) is discount factor, and \(0\le \alpha \le 1\) is learning rate. Note that, the delayed reward \(r_{t+1}^i ( {s_{t+1}^i })\) for action selection at time \(t\) is dependent on the state at time \(t+1\) and so it is received at time \(t+1\). Also note that, higher \(\gamma \) value causes greater dependency on the discounted future reward \(\gamma \max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})\) rather than the delayed reward \(r_{t+1}^i (s_{t+1}^i )\); while higher \(\alpha \) value causes greater dependency on the delayed reward \(r_{t+1}^i ( {s_{t+1}^i })\) and the discounted future reward \(\gamma \max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})\) rather than the Q-value \(Q_t^i ( {s_t^i ,a_t^i })\) at time \(t\).

An agent \(i\) observes state \(s_t^i \) from the operating environment and chooses an action \(a_t^i \) at decision epoch \(t\). The state \(s_t^i \) changes to \(s_{t+1}^i \) at decision epoch \(t+1\). Subsequently, the agent receives delayed reward \(r_{t+1}^i ( {s_{t+1}^i })\) and updates Q-value \(Q_{t+1}^i ( {s_t^i ,a_t^i })\) using Eq. (1). The Q-value \(Q_{t+1}^i ( {s_t^i ,a_t^i })\) is updated using the maximum discounted future reward \(\gamma \max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})\) as the agent takes the optimal action in any future states at time \(t,t+1,\ldots \). As time progresses, the agent receives a sequence of rewards which contribute to the convergence of the Q-values to long-term rewards. The agent chooses an optimal action through maximizing value function \(V^\pi (s_t^i )\) as shown below:

$$\begin{aligned} V^\pi ( {s_t^i })=\max \nolimits _{a\in A} (Q_t^i ( {s_t^i ,a})) \end{aligned}$$
(2)

Hence, agent \(i\)’s policy is as follows:

$$\begin{aligned} \pi _i ( {s_t^i })=\text{ argmax }_{a\in A} (Q_t^i ( {s_t^i ,a})) \end{aligned}$$
(3)

In some cases, negative reward represents cost, which must be minimized, and so \(V^\pi ( {s_t^i })=\min _{a\in A} (Q_t^i ( {s_t^i ,a}))\) and \(\pi _i ( {s_t^i })=\text{ argmin }_{a\in A} (Q_t^i ( {s_t^i ,a}))\). Note that, choosing the optimal action using Eq. (3) at all times does not update the Q-values of other actions, which may cause the agent to converge to local optimal solutions. Hence, there are two methods for action selection. Exploitation chooses the best-known optimal action for performance enhancement; while exploration chooses the other actions once in a while to update the Q-values of other actions so that better actions may be discovered.

1.4 Application schemes of wireless sensor networks

Reinforcement learning is a versatile and universal solution to most problems and open issues associated with the dynamicity and uncertainty of the operating environment. RL has been applied in various schemes in WSNs as follows:

  1. A.1

    Medium access control (MAC) MAC protocols coordinate channel access among multiple nodes in a single-hop transmission to reduce collisions. Two main functions are sleep-wake scheduler [79] and transceiver selector [10] as follows:

    1. A.1.1

      Sleep-wake scheduler arranges the transmission, reception, idle and sleeping time durations. During the idle mode, sensor nodes listen for potential packet transmissions and the energy consumption is almost identical to that of receive mode. To reduce energy consumption, a sleep-wake scheduler schedules sleeping and waking (i.e. transmission, reception and idle) time durations. There are two main purposes in sleep-wake scheduling. Firstly, longer waking time duration (or higher duty cycle) increases bandwidth availability leading to higher throughput and lower packet latency; however, it increases energy consumption. The waking time duration may increase with network traffic load [8, 9] or Quality of Service (QoS) requirements [11]. RL has been applied to minimize collisions and energy consumption in slot assignment [7], as well as to estimate traffic arrivals from neighboring nodes in order to adjust the sleeping and waking time durations [8, 9]. Secondly, a mobile data collector node moves within an area to collect sensing outcomes from static sensor nodes [12]. RL has been applied in each sensor node to learn the waking time duration based on the arrival pattern of the mobile data collector node [12] in order to increase in-contact with the mobile data collector node while reducing energy consumption.

    2. A.1.2

      Transceiver selector selects either a long-range or short-range radio for data and control packet transmissions. Long-range (short-range) radio uses higher (lower) transmission power. To reduce energy consumption, a transceiver selector switches in between the transceivers based on physical range (e.g. whenever a mobile node moves from one effective transmission range to another) and channel conditions (e.g. fading, interference, shadowing, and multi-path effects) [10].

  2. A.2

    Cooperative communications select cooperative forward packets towards sink nodes in order to ameliorate the effects of deteriorating channel conditions and changes in network topology. For instance, in forwarding nodes to Fig. 2, the direct transmission \(i\rightarrow j\) (or from node \(i\) to forwarding node \(j)\) is unsuccessful. Any packet retransmission through direct transmission \(i\rightarrow j\) may still be unsuccessful if the channel experiences deep fading for a long period of time. Since node \(k\) overhears the packet, cooperative communication enables indirect transmission \(i\rightarrow k\rightarrow j\). Since there may be a number of potential cooperative nodes \(k\in K\), RL has been applied at node \(i\) to select a cooperative node [13], thereby providing spatial and time diversity gains.

Fig. 2
figure 2

Cooperative communications

  1. A.3

    Routing enables a sensor node to search for the best route to a sink node in clustered [14] and non-clustered networks [5]. Generally speaking, clustering segregates the entire network into groups with each consists of a clusterhead and member nodes. The clusterhead collects, processes and aggregates sensing outcomes received from member nodes, and subsequently send them to the sink node through single or multiple hops. RL has been applied in each sensor node to learn the best route to the sink node.

  2. A.4

    Rate control adjusts the packet transmission rate of a source node, and hence the congestion level of intermediate nodes, along a route [15, 16].

  3. A.5

    Sensing coverage is a WSN application that maximizes the physical sensing coverage of an area so that any event of interest is accurately detected by at least a single sensor node. Sensing coverage can be applied in surveillance and monitoring tasks (e.g., intruder and fire detections). To reduce energy consumption, RL has been applied in each sensor node to minimize the overlapping of sensing coverages [17].

  4. A.6

    Task scheduling schedules and carries out the right task at different time instant. For instance, in [18], RL has been applied in each sensor node to learn the usefulness of each task (i.e. sensing, transmitting, receiving, aggregating data, and sleeping) at different time instant in order to reduce energy consumption.

2 Reinforcement learning: components, features and enhancements

This section presents the traditional and enhanced components and features of RL in the context of WSNs. For each component and feature, we show the traditional approach and subsequently the alternative or enhanced approaches.

2.1 State

Traditionally, each state is comprised of a single type of information. For instance, each state \(s_t^i \in S=\left\{ {1,2,\ldots ,K} \right\} \) represents the number of packets in the buffer queue [8]. The state representation can be enhanced in two ways. Firstly, the state may not be represented because there is a single state only, and this is called stateless [7]. Secondly, each substate may be comprised of distinctive substates. For instance, state \({\mathbf{s}}_{\mathbf{t}}^{\mathbf{i}} =( {s_{x,t}^i ,s_{y,t}^i })\in S\), where \(s_{x,t}^i \in S_x \) and \(s_{y,t}^i \in S_y \) represent a set of potential neighboring nodes and data flows, respectively) [32].

The state representation can be further enhanced through minimization of Hamming distance for state space. In general, larger state space increases memory requirement, and reduces the convergence rate to optimal actions since an agent must explore more state-action pairs. The number of states can be reduced based on Hamming distance [12]. An agent calculates the weighted Hamming distance between two states, specifically \(H( {s_1 -s_2 })=W_1 \cdot \left| {V_1 ( {s_1 })-V_1 ( {s_2 })} \right| +W_2 \cdot \left| {V_2 ( {s_1 })-V_2 ( {s_2 })} \right| +\cdots +W_N \cdot \vert V_N ( {s_1 })-V_N ( {s_2 })\vert \), where the weight \(W_n \) represents the significance of the corresponding variable \(\vert V_n ( {s_1 })-V_n ( {s_2 })\vert \) in differentiating the two states. Both states \(s_1 \) and \(s_2 \) share a single entry in the Q-table if their Hamming distance is less than a threshold or \(H( {s_1 -s_2 })<H_T \).

2.2 Action

Traditionally, each action represents a single action out of a set of possible actions. For instance, in a routing scheme A(3) [5, 14], each action \(a_t^i \in A=\{1,2,\ldots ,K\}\) represents a next-hop node for packet transmission, while \(A\) represents a set of all neighbor nodes. The action representation can be enhanced in two ways. Firstly, each action may be represented by subactions. For instance, in [7, 13], action \({\mathbf{a}}_{\mathbf{t}}^{\mathbf{i}} =( {a_{1,t}^i ,a_{2,t}^i ,\ldots ,a_{K,t}^i })\in A_1 \times A_2 \times \cdots \times A_K \), where \(a_{k,t}^i \in A_k =\{0,1\}\). Secondly, each subaction may further be comprised of distinctive subactions. For instance, in [32], action \({\mathbf{a}}_{\mathbf{t}}^{\mathbf{i}} =( {a_{x,t}^i ,a_{y,t}^i })\in A\), where \(a_{x,t}^i \in A_x =\{0,1\}\) and \(a_{y,t}^i \in A_y =\{a_{y,1,t}^i ,a_{y,2,t}^i ,\ldots ,a_{y,K,t}^i \}\).

2.3 Delayed reward

Traditionally, each delayed reward represents the performance enhancement achieved by a state-action pair. A single reward computation approach is applicable to all state-action pairs. For example, in a sleep-wake scheduler A(1.1) [8], the delayed reward \(r_{t+1}^i ( {a_{t+1}^i })\) is a ratio of the effective transmission and reception time durations to the waking time duration. Additionally, the delayed reward can be a constant value, such as \(r_{t+1}^i ( {a_{t+1}^i })=1\) or \(-1\) to indicate successful and unsuccessful transmissions [7]. The delayed reward can be further enhanced in the context of WSNs as described next.

2.3.1 Distinctive reward functions

Different reward functions can be used to compute rewards under distinctive network conditions [5, 10].

As an example, in a transceiver selector A(1.2) [10], the action is to select a transceiver and its transmission power level for packet transmissions. The cost (or negative reward) for each packet transmission depends on the amount of energy consumption, and it is a function of the number of retransmissions, MAC delays (i.e. channel sensing and backoff), transmission and reception power levels, as well as packet size. Higher cost indicates higher energy consumption. However, there is a condition in which the reward computation is different. Specifically, if transmissions are unsuccessful even though the highest transmission power has been used, a zero reward value is assigned in order to avoid the agent from exploring other actions with lower transmission power levels until the transmissions are successful at the highest transmission power level.

2.3.2 Average delayed reward function

Traditionally, delayed reward is an instant value. The application of an average delayed reward has been shown to improve the overall system performance [19], and it has been applied in [16, 20].

As an example, in [16], the average delayed reward is as follows:

$$\begin{aligned} r_{a,t+1}^i ( {s_{t+1}^i })&\leftarrow r_{a,t}^i ( {s_{t+1}^i })+\alpha _r \Big [ r_t^i ( {s_{t+1}^i })-r_{a,t}^i ( {s_{t+1}^i })\nonumber \\&\qquad +\max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})-\max \nolimits _{a\in A} Q_t^i ( {s_t^i ,a}) \Big ] \end{aligned}$$
(4)

where \(r_{a,t}^i ( {s_{t+1}^i })\) represents the average delayed reward, and \(\alpha _r \) represents the learning rate of the average delayed reward computation. The Q-function (1) is rewritten to incorporate the average delayed reward as follows:

$$\begin{aligned} Q_{t+1}^i ( {s_t^i ,a_t^i })&\leftarrow ( {1-\alpha })Q_t^i ( {s_t^i ,a_t^i })\nonumber \\&\qquad +\,\alpha \left[ {r_{t+1}^i ( {s_{t+1}^i })-r_{a,t}^i ( {s_{t+1}^i })+\gamma \max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})} \right] \end{aligned}$$
(5)

In [16], the average delayed reward approach is applied in congestion avoidance A(4) to adjust the packet transmission rate of a source node in order to adjust the congestion level. The state represents the number of packets in the buffer queue; the action selects a next-hop node and a packet transmission rate; and the delayed reward is a function of energy efficiency and packet loss rate.

As another example, in [20], the average delayed reward is as follows:

$$\begin{aligned} r_{a,t+1}^i ( {s_{t+1}^i })\leftarrow (1-\alpha _r )\cdot r_{a,t}^i ( {s_{t+1}^i })+\alpha _r \cdot r_t^i ( {s_{t+1}^i }) \end{aligned}$$
(6)

The Q-function (1) is rewritten to incorporate the average delayed reward as follows:

$$\begin{aligned} Q_{t+1}^i ( {s_t^i ,a_t^i })&\leftarrow ( {1-\alpha })Q_t^i ( {s_t^i ,a_t^i })\nonumber \\&\qquad +\,\alpha \left[ {r_{t+1}^i ( {s_{t+1}^i })-\gamma (r_{a,t}^i ( {s_{t+1}^i })+\max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a}))} \right] \quad \quad \end{aligned}$$
(7)

In [20], the average delayed reward approach is applied in a sleep wake scheduler A(1.1) to adjust the waking time duration (or duty cycle), transmission power and modulation levels in order to reduce energy consumption. The state represents channel gain and the number of packets in the buffer queue; the action selects the waking time duration, as well as transmission power and modulation levels; and the reward is a ratio of the number of received and transmitted packets to energy consumption and the processing cost in the buffer queue.

2.4 Discounted reward

Traditionally, the discounted reward has been applied to indicate the dependency of Q-function on future rewards. As an example, in a routing scheme A(3) [5], the delayed reward represents the link cost from a node to a next-hop node; while the discounted reward \(\gamma \max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})\) represents the route cost from the next-hop node to a sink node, which may be multiple hops away. The discounted reward may be omitted with \(\gamma =0\) to show the lack of dependency on future rewards, and this approach is generally called the myopic approach which enables an agent to adapt to instantaneous changes in the operating environment [21]; and further discussion on this approach is presented in Sect. 3.1. The discounted reward can be further enhanced using average discounted reward function. Generally speaking, the future reward may be uncertain in some cases, and so an agent may be uncertain about its action selection. In [22], the average discounted Q-value is computed for all possible actions; and hence \(\max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})\) in Eq. (1) is replaced by \(\mathop \sum _{a\in A} [P(a)\times Q_t^i ( {s_{t+1}^i ,a})]/\mathop \sum _{a\in A} P(a)\). Note that, if all possible actions are taken into account, then \(\mathop \sum _{a\in A} P( a)=1\).

2.5 Q-function

The traditional Q-function (see Eq. (1)) can be further enhanced.

2.5.1 Q-value Initialization

Generally speaking, the Q-values are initialized to a certain value (e.g. a zero value) so that all possible actions are given a fair chance during exploration. However, it can be initialized with different values to speed up rate. For instance, in a cooperative communication scheme A(2) [13], the Q-values are initialized based on the distance between a node \(i\) and its next-hop node \(j\) in which higher Q-values indicate more favorable nodes in making the progress in terms of the physical distance towards a sink node in order to reduce end-to-end delay.

2.5.2 Reward equivalent Q-function

In [23, 24], the learning rate and discount factor are set to \(\alpha =1\) and \(\gamma =0\), so the Q-function equals delayed reward \(Q_{t+1}^i ( {a_t^i })=r_{t+1}^i ( {a_t^i })\), and it is applied to speed up the learning process in a routing scheme A(3). In [23], a node \(i\) selects its next-hop node \(a_t^i \) and updates its Q-value \(Q_{t+1}^i ( {a_t^i })=r_{t+1}^i ( {a_t^i })=c_{a_t^i } +\text{ min }_a Q_t^{a_t^i } ( a)\), where \(c_{a_t^i } \) represents the link cost between node \(i\) and its next-hop node \(a_t^i \), and \(\text{ min }_a Q_t^{a_t^i } ( a)\) indicates that node \(a_t^i \) chooses its next-hop node with the minimum Q-value. Note that, nodes must exchange Q-values, which indicates the route cost to a destination sink node, among themselves.

2.6 Exploration and exploitation

Traditionally, there are two popular approaches to achieve a balanced trade-off between exploration and exploitation, namely softmax and \(\varepsilon \)-greedy [2] which have been applied in [4, 5, 8, 14, 23, 25], respectively. For instance, in [8], an agent chooses exploration actions with a small probability \(\varepsilon \) and exploitation actions with a probability \(1-\varepsilon \). In [21], during initial exploration, an agent explores all the available actions in a round-robin manner in order to discover the Q-values of all actions [21]. The exploration and exploitation mechanism can be further enhanced through adjusting the exploration probability.

The exploration probability may be adjusted based on the uncertain and dynamic levels of the operating environment due to nodal mobility and varying channel conditions. As an example, in [26], using the \(\varepsilon \)-greedy approach, node \(i\) adjusts its exploration probability \(\varepsilon _t^i =n_{a+d,T}^i /n_T^i \), where \(n_{a+d,T}^i \) represents the number of nodes that appear and disappear in node \(i\)’s transmission range within a time window \(T\), and \(n_T^i \) represents the number of node \(i\)’s neighboring nodes. As another example, in [12], node \(i\) adjusts its exploration probability \(\varepsilon _t^i =\varepsilon _{min} +\text{ max }[0,( {\varepsilon _{max} -\varepsilon _{min} })\times ( {e_{max} -e})/e_{max} ]\), where \(e\) represents the number of events of interest with lower \(e\) value increases the exploration probability \(\varepsilon _t^i \).

The exploration probability may also be adjusted based on action selection. In [24], using the \(\varepsilon \)-greedy approach, node \(i\) adjusts its exploration probability as follows:

$$\begin{aligned} \varepsilon _{t+1}^i =\left\{ {{\begin{array}{ll} {\varepsilon _t^i +\varepsilon _{step} ,} &{}\quad {if\ a_t^i \ne a_{t-1}^i } \\ {\varepsilon _t^i -\varepsilon _{step} ,} &{}\quad {\text{ otherwise }} \\ \end{array} }} \right. \end{aligned}$$
(8)

Note that, \(\varepsilon _{t+1}^i =\varepsilon _t^i +\varepsilon _{step} \) helps to discover the optimal actions when the operating environment becomes unstable (i.e. when the consecutive actions change, or \(a_t^i \ne a_{t-1}^i )\), while \(\varepsilon _t^i =\varepsilon _t^i -\varepsilon _{step} \) helps to achieve the optimal actions when the operating environment becomes stable.

3 Reinforcement learning: algorithms

The traditional RL approach (see Sect. 1.1) has been applies in various schemes to provide performance enhancement in WSNs as shown in Table 1.

A major contribution of this section is the discussion on a number of new additions and enhancements to the traditional RL algorithms, which have been applied to various schemes in WSNs. A summary of the new RL models and algorithms is shown in Table 2. The following subsections describe the model and algorithm, including the purpose(s) of the scheme(s), followed by its associated RL model (i.e. state, action and reward representations), and finally the algorithm.

Table 1 RL models with direct application of the traditional RL approach for various schemes in WSNs
Table 2 Summary of RL models and algorithms for various schemes in WSNs

3.1 Algorithm 1: myopic RL model with \(\gamma =0\)

The myopic RL model sets the discount factor to zero value or \(\gamma =0\), so that there is lack of dependency on future rewards, and it has been applied in MAC protocols A(1) [7, 29, 30] and clustering A(3) [14]. Table 3 presents the RL algorithm for the myopic RL model.

Table 3 Myopic RL algorithm with discount factor \(\gamma =0\)

3.1.1 Chu’s slot assignment scheme for MAC protocol

Chu et al. [7] propose a slot assignment scheme for MAC protocol A(1) using the myopic RL model (see Table 3), and it has been shown to increase throughput P(1), as well as to reduce end-to-end delay P(2) and energy consumption P(3). The purpose is to select a time slot within a time frame for data transmission in order to minimize collisions in a time-slotted MAC protocol.

Table 4 shows the RL model for the scheme, and it is embedded in each sensor node to keep track of the possibility of successful data transmission in each time slot using Q-value \(Q_{t+1}^i ( {a_{k,t}^i })\). The state is not represented. The action \({\mathbf{a}}_{ \mathbf{t}}^{\mathbf{i}} \) is to select time slot(s) for data transmission. The reward \(r_{t+1}^i ( {a_{k,t}^i })\) indicates successful and unsuccessful transmissions in time slot \(k\), respectively. The Q-function (1) is rewritten as \(Q_{t+1}^i ( {a_{k,t}^i })\leftarrow (1-\alpha )Q_t^i ( {a_{k,t}^i })+\alpha \cdot r_{t+1}^i ( {a_{k,t}^i })\) since the state is not represented; and time slots with higher Q-value indicate higher possibility of successful transmission, so these slots are selected for transmission. Similar RL model has also been applied in (Mihaylov et al. 2012) [30].

Table 4 RL model for Chu’s slot assignment scheme [7]

3.1.2 Forster’s intra-cluster routing scheme

Forster and Murphy [14] propose an intra-cluster routing scheme for clustered networks A(3) using the myopic RL model (see Table 3), and it has been shown to increase throughput P(1), as well as to reduce energy consumption P(3). The purpose is to enable a member node to select a next-hop neighbor node, which provides a route with lower number of hops and higher residual energy towards the clusterhead in a multi-hop cluster. The proposed scheme helps to achieve a balanced energy consumption among member nodes in a cluster in order to prolong network lifetime.

Table 5 shows the RL model for the scheme, and it is embedded in each sensor node to keep track of the cost of a route leading to the clusterhead node using Q-value \(Q_{t+1}^i ( {a_t^i })\). The state is not represented. The action \(a_t^i \) represents the selection of a next-hop node \(j\) to forward packets to clusterhead. The reward \(r_{t+1}^i ( {a_t^i })\) represents the link cost to the next-hop node.

Table 5 RL model for Forster’s intra-cluster routing scheme [14]

3.2 Algorithm 2: RL model with continuous space representation

RL suffers from the curse of dimensionality in which the accuracy of state and action space representations increases with smaller sizes of space partitions, respectively. However, memory capacity is limited at each sensor node, so RL model with continuous action space is applied to address this. An example of the Q-values of actions for a particular state is shown in Fig. 3, in which an interval of (0,1) is partitioned into a set of discrete actions and their corresponding Q-values. In this RL model, a continuous action is computed based on the average of two adjacent discrete actions weighted by their respective Q-values [31].

Fig. 3
figure 3

An interval of (0,1) is partitioned into discrete actions

Table 6 presents the RL algorithm with continuous action space representation. In step (a), with reference to Fig. 3, the continuous action \(a_t^i \) is chosen based on state \(s_t^i \) by averaging the discrete actions \(a_{k,t}^i \) and \(a_{k+1,t}^i \) weighted by their respective Q-values \(Q_t^i ( {s_t^i ,a_{k,t}^i })\) and \(Q_t^i ( {s_t^i ,a_{k+1,t}^i })\). Also, in step (d), the updates of the Q-values \(Q_t^i ( {s_t^i ,a_{k,t}^i })\) and \(Q_t^i ( {s_t^i ,a_{k+1,t}^i })\) are weighted by the linear distance between \(a_{k,t}^i \), \(a_{k+1,t}^i \) and \(a_t^i \).

Table 6 RL algorithm with continuous action space representation

3.2.1 Niu’s sleep-wake scheduling scheme for MAC protocol

Niu and Deng [31] propose a sleep-wake scheduling scheme A(1.1) for MAC protocol using the RL model with continuous action space representation (see Table 6), and it has been shown to increase throughput P(1), as well as to reduce end-to-end delay P(2) and energy consumption P(3). The purpose is to select the probability of transmission during the data transmission stage in a time-slotted MAC protocol.

Table 7 shows the RL model for the scheme, and it is embedded in each sensor node to keep track of the probability of transmission during the data transmission stage using Q-value \(Q_{t+1}^i ( {s_t^i ,a_t^i })\). The state represents the changing trend of the number of packets in the buffer queue. The continuous action \(a_t^i \) is to select the probability of transmission during the data transmission stage. The reward \(r_{t+1}^i ( {a_t^i })\) depends on energy consumption levels and the number of packets in the buffer queue (and hence packet latency), and so the scheme aims to achieve a balanced performance in energy consumption and packet latency.

Table 7 RL model for Niu’s sleep-wake scheduling scheme [31]

3.3 Algorithm 3: RL model with directed exploration

Traditionally, an agent adopts an undirected exploration approach (e.g. \(\varepsilon \)-greedy), in which the agent explores actions in a random manner during exploration. An enhanced approach called directed exploration enables an agent to explore actions in a guided manner using domain-specific knowledge (e.g. rewards) or rules in order to improve the convergence rate to the optimal action [21]. For instance, in [21], the exploration probability is adjusted according to two conditions in regards to the variations of rewards caused by uncertainties in the operating environment. Firstly, the learning speed increases with the variations of rewards. Secondly, an agent exploits at all times until there are variations in the rewards, which initiate the exploration procedure.

3.3.1 Alberola’s sleep-wake scheduling scheme for MAC protocol

Alberola and Pesch [21] propose a sleep-wake scheduling scheme A(1.1) for MAC protocol using the RL model with directed exploration, and it has been shown to increase throughput P(1), as well as to reduce end-to-end delay P(2) and energy consumption P(3). The purpose is to achieve a balanced tradeoff between the sleeping time and the waking or active time (or duty cycle) in a time-slotted MAC protocol in order to reduce energy consumption.

Table 8 shows the RL model for the scheme, and it is embedded in a centralized node that collects data from sensor nodes in a single hop to determine the optimal active time duration using Q-value \(Q_{t+1}^i ( {s_t^i ,a_t^i })\). The centralized node collects network statistics during active time duration to estimate the incoming traffic level from neighboring nodes, and subsequently learn the optimal duty cycle. The state is not represented. The action \(a_t^i \) is to select the number of time slots for sleeping within a time frame. Higher \(a_t^i \) value indicates lower duty cycles and so there is lower energy consumption. Denote the idle time duration by \(t_{IL}^i \), as well as the number of packets in the buffer queue and its threshold by \(q_t^i \) and \(q_{Th} \), respectively. The cost is \(r_{t+1}^i ( {a_t^i })=0\) if the active duration is fully utilized with idle time \(t_{IL}^i =0\), and there is no buffer overflow (or \(q_t^i <q_{Th} )\). The cost is \(r_{t+1}^i ( {a_t^i })=-1\) if there is buffer overflow, which indicates that the active duration is too short or insufficient. The cost is \(r_{t+1}^i ( {a_t^i })=-t_{IL}^i \) if the active time duration is too long resulting in energy wastage. The myopic RL algorithm in Table 3 (see Sect. 3.1) is applied to update the Q-values.

Table 8 RL model for Alberola’s sleep-wake scheduling scheme [21]

In [21], the RL model with directed exploration specifies a set of strategies, namely random, round-robin and greedy strategies. The random strategy is applied during exploration, and it is as follows:

$$\begin{aligned} \pi _{1,i} ( a)=\left\{ {{\begin{array}{l} {\text{ rand }_{a<a_{t-1}^i } (a), \quad \mathrm{if}\ r_t^i ( {a_t^i })>r_{t-1}^i ( {a_{t-1}^i })} \\ {\text{ rand }_{a\ge a_{t-1}^i } (a),\quad \text{ otherwise }} \\ \end{array} }} \right. \end{aligned}$$

In this strategy, \(\text{ rand }_{a<a_{t-1}^i } (a)\) chooses a lower number of time slots for sleeping \(a_t^i \) in a random manner in order to reduce the inactive time duration if the reward has increased at time \(t\), or \(r_t^i ( {a_t^i })>r_{t-1}^i ( {a_{t-1}^i })\), which indicates that the incoming traffic level has increased.

The round-robin strategy is applied during exploitation whenever there are extreme cases in which the centralized node either has buffer overflow or is always in idle listening \(t_{IL}^i =1\), and the rule is as follows:

$$\begin{aligned} \pi _{2,i} ( a)=\left\{ {{\begin{array}{ll} {a_{t-1}^i -1,} &{} \quad {\text{ if }\;q_t^i >q_{Th} } \\ {a_{t-1}^i +1,} &{} \quad {\text{ if }\;t_{IL}^i =1} \\ \end{array} }} \right. \end{aligned}$$

In this strategy, \(a_{t-1}^i -1\) reduces its sleeping time slots by \(1\) if \(q_t^i >q_{Th} \); while \(a_{t-1}^i +1\) increases its sleeping time slots by \(1\) i\(\text{ f } t_{IL}^i =1\). The exploration probability is increased by \(\varepsilon _t^i =\varepsilon _{max} /10\) to increase exploration.

The greedy strategy is applied during exploitation in cases other than the two aforementioned extreme cases.

$$\begin{aligned} \pi _{3,i} ( a)=\left\{ {{\begin{array}{ll} {\text{ argmax }_{a<a_{t-1}^i } Q_t^i (a),} &{} \quad {\text{ if }\;r_t^i ( {a_t^i })>r_{t-1}^i ( {a_{t-1}^i })} \\ {\text{ argmax }_{a\ge a_{t-1}^i } Q_t^i (a),} &{} \quad {\text{ otherwise }}\\ \end{array} }} \right. \end{aligned}$$

In this strategy, \(\text{ argmax }_{a<a_{t-1}^i } Q_t^i (a)\) chooses a lower number of time slots for sleeping \(a_t^i \) with the maximum Q-value \(Q_t^i (a_t^i )\) in order to reduce sleeping time duration if the reward has increased at time \(t\), or \(r_t^i ( {a_t^i })>r_{t-1}^i ( {a_{t-1}^i })\). The learning rate is increased by \(\alpha _t^i =\alpha _{max} /10\) to speed up learning; however, the learning rate and exploration probability are decreased whenever the current and previous actions are similar, or \(a_t^i =a_{t-1}^i \), to avoid oscillations in action selection.

3.4 Algorithm 4: cooperative RL model

Traditionally, an agent makes decisions on action selection, which may be locally optimal, in an independent manner without communicating with neighboring agents. In order to make decisions on globally optimal action selection, the cooperative RL approach enables agents to observe the local operating environment, exchange information (e.g. states and Q-values) among themselves, and subsequently select local actions as part of the optimal joint action for network-wide performance enhancement. Hence, each local action can affect and can be affected by other agents. This cooperative approach is suitable for schemes that require collaborative efforts in a shared wireless medium. For instance, in routing, nodes along a route, as well as their respective neighboring nodes, must collaborate to reduce interference for end-to-end QoS enhancement. Section 3.4.1 presents cooperative RL algorithms. Section 3.4.2 presents application schemes that apply the cooperative RL algorithms.

3.4.1 Cooperative RL algorithms

This section presents two cooperative RL algorithms applied to WSNs.

Liang’s cooperative function [13] has been applied in cooperative communication scheme A(2); and the discussion of this algorithm is based on this application. Denote the current-hop cooperative nodes and next-hop nodes as \(H_n \) and \(H_{n+1} \) respectively, which can be viewed as the current and neighboring groups of agents. Node \(i\in H_n \) forwards a packet to node \(j\in H_{n+1} \), and keeps track of its Q-value \(Q_{t+1}^i ( {s_t^i ,a_t^i })\) as follows:

$$\begin{aligned} Q_{t+1}^i ( {s_t^i ,a_t^i })&\leftarrow (1-\alpha )Q_t^i ( {s_t^i ,a_t^i })\nonumber \\&+\,\alpha \bigl [r_{t+1}^i ( {s_{t+1}^i })+\gamma w( {i,j})\max \nolimits _{a_j \in H_{n+1} } ( {Q_t^j ( {s_t^j ,a_j })}) \nonumber \\&+\,\gamma {\sum }_{i^{'}\in H_n ,i^{'}\ne i} w( {i,i^{'}}) \max \nolimits _{a_{i^{'}} \in H_{n\backslash i} } ( {Q_t^{i^{'}} ( {s_t^{i^{'}} ,a_{i^{'}} })})\bigr ]\qquad \end{aligned}$$
(10)

where \(w( {i,j})\) represents the weight of node \(i\) on node \(j\)’s Q-value, in which higher \(w( {i,j})\) indicates greater effects; while \(w( {i,i^{'}})\) represents the weight of node \(i\) on node \(i\)’s cooperative nodes in \(H_n \). Note that, with respect to Eq. (10), the third term depends on the maximum Q-value of node \(j\in H_{n+1} \); and the fourth term depends on the maximum Q-values of all nodes in \(H_n \) except node \(i\) itself. Hence, nodes exchange and apply information (i.e. Q-values) with neighboring nodes, and maximize their own and their respective neighboring nodes’ Q-values in order to maximize the global Q-value. At the next time instant, node \(i\in H_n \) with the highest Q-value is selected as the forwarding node; while the rest of the nodes \(i^{'}\in H_n \backslash i\) become cooperative nodes.

Distributed value function (DVF) approach has been applied in task scheduling A(6) [18]. Node \({i}\) calculates and exchanges local value function \({V}^{i}( {{s}_{t}^{i} })\) (see Eq. (2)) with its neighbor nodes \({j}\in {J}\), and keeps track of its Q-value \({Q}_{{t}+1}^{i} ( {{s}_{t}^{i} ,{a}_{t}^{i} })\) as follows:

$$\begin{aligned} Q_{t+1}^i ( {s_t^i ,a_t^i })\leftarrow (1-\alpha )Q_t^i ( {s_t^i ,a_t^i })+\alpha \left[ r_{t+1}^i ( {s_{t+1}^i })+\gamma \mathop \sum \limits _{j\in J} w( {i,j})V^j( {s_{t+1}^i })\right] \nonumber \\ \end{aligned}$$
(11)

where \(w( {i,j})\) represents the weight of node \(i\) on neighbor node \(j\in J\)’s Q-value. For instance, in [17], the weights for all neighbor node \(j\)’s Q-values are equal, specifically \(w( {i,j})=1/\vert J\vert \).

3.4.2 Application schemes with cooperative RL algorithms

This section presents three schemes that apply the cooperative RL model.

Liang’s cooperative communication scheme

Liang et al. [13] propose a cooperative communication scheme A(2) using Liang’s cooperative function (see Sect. 3.4.1.1), and it has been shown to increase throughput P(1), as well as to reduce end-to-end delay P(2). The purpose is to select a forwarding node \(j\in H_{n+1} \) for data transmission from node \(i\) in order to minimize packet loss as shown in Fig. 4, where \(K=\vert H_n \vert \) is the number of a set of cooperative nodes at the current hop \(H_n \). Generally speaking, cooperative nodes that form the set \(K\) can hear two-way routing messages (i.e. Route Request, RREQ and Route Reply, RREP) between nodes \(i\) and \(j\) [32]. A packet may pass through many sets of forwarding and cooperative nodes as it traverses across the network from a sensor node to a sink node.

Fig. 4
figure 4

Node \(i\) selects node \(j\in H_{n+1} \) as forwarding node for data transmission. Nodes in \(H_n \), where \(K=\vert H_n \vert \), become cooperative nodes in the current hop \(H_n \)

Table 9 shows the RL model for the scheme, and it is embedded in each sensor node to keep track of the progress of a packet in terms of the physical distance towards the sink node over time using Q-value \(Q_{t+1}^i ( {a_{k,t}^i })\). Specifically, it is the progress being made from the current hop \(H_n \) to the next hop \(H_{n+1} \). The state represents the hop of a packet with respect to the current hop \(H_n \) in which node \(i\) resides. The action \({\mathbf{a}}_{\mathbf{t}}^{\mathbf{i}} \) is to select a forwarding node to forward the data packet. All nodes in the current hop \(H_n \) receive positive rewards and update their Q-values accordingly when the nodes hear further transmission from \(H_{n+1} \) to \(H_{n+2} \), which indicates a successful transmission from \(H_n \) to \(H_{n+1} \), otherwise the nodes receive negative rewards. Higher positive rewards indicate higher transmission quality in which the packet has made greater progress in terms of the physical distance towards the sink node over time; while higher negative rewards indicate the amount of timeout duration being wasted as a result of unsuccessful packet transmission. Both positive and negative rewards are normalized values. All nodes in the current hop \(H_n \) update with the similar positive (or negative) rewards because all of them have made the correct (or incorrect) action in the selection of a forwarding node. Node \(i\in H_n \) forwards a packet to node \(j\in H_{n+1} \), and keeps track of its Q-value \(Q_{t+1}^i ( {s_t^i ,a_t^i })\) using Eq. (10).

Table 9 RL model for Liang’s cooperative communication scheme [13]

Similar RL model and algorithm have been applied to achieve other performance enhancements in cooperative communications [3234] and routing [26]. The RL model is redefined in [32, 33]. As an example, with respect to node \(i\) in a cooperative communication scheme A(2) [33], the state represents QoS satisfaction/violation levels; the action represents the selection of a cooperative node; and the reward represents the improvement on packet delivery rate and packet latency brought about by indirect transmission. As another example, with respect to node \(i\) in a routing scheme A(3) [26], the state represents a set of neighboring node and QoS requirements of packets; the action represents the selection of a next-hop node; and the reward represents the inverse of packet latency, and so higher rewards indicates shorter single-hop delay.

Next, Sect. 3.4.2.2 presents the extension of the RL model [32] presented in this section [13, 33] to reduce energy consumption and enhance QoS.

Liang’s cooperative communication scheme with QoS enhancement

Liang et al. [32] propose a cooperative communication scheme A(2) using Liang’s cooperative function (see Sect. 3.4.1.1) to provide QoS enhancement, and it has been shown to increase throughput P(1), as well as to reduce end-to-end delay P(2) and energy consumption P(3). The purpose is to select a forwarding node \(j\in H_{n+1} \) for data transmission from node \(i\), and to adjust its transmission power level, in order to reduce energy consumption and enhance QoS.

Generally speaking, to provide QoS enhancement, Liang et al. [32] use the same RL algorithm in Sect. 3.4.2.1, and the difference is that, the RL model is redefined. Table 10 shows the RL model for the scheme, and it is embedded in each sensor node to keep track of the contribution, which is based on the successful packet transmission, packet latency, and energy efficiency, of indirect transmission compared to direct transmission, using Q-value \(Q_{t+1}^i ( {a_{k,t}^i })\). The state \({\mathbf{s}}_{\mathbf{t}}^{\mathbf{i}} \) represents a set of potential forwarding and cooperative nodes in the current hop \(H_n \), and a set of data flow at node \(i\). The action \({\mathbf{a}}_{\mathbf{t}}^{\mathbf{i}} \) is to select whether to forward packets of a data flow and the transmission power level; hence, a node may transmit packets using an appropriate transmission power level adaptively according to the channel condition. The reward represents the improvement on packet delivery rate, packet latency, and energy efficiency brought about by indirect transmission; and this information is indicated in the acknowledgement (ACK) packets sent by the forwarding node \(j\in H_{n+1} \) to node \(i\).

Table 10 RL model for Liang’s cooperative communication scheme with QoS enhancement [32]

Tham’s sensing coverage scheme Seah et al. [4], Tham and Renaud [17], and Renaud and Tham [35] propose a sensing coverage scheme A(5) using DVF (see Sect. 3.4.1.2), and it has been shown to increase sensing coverage P(4), as well as to reduce energy consumption P(3). The purpose is to select a sensing coverage level for each sensor node \(i\) in monitoring tasks in order to minimize overlapping of sensing coverage with neighbor node \(j\in J\).

Table 11 shows the RL model for the scheme, and it is embedded in each sensor node to keep track of the coverage of each grid point in the surrounding area of the sensor node using Q-value \(Q_{t+1}^i ( {s_t^i ,a_t^i })\) [4]. The state \(s_t^i \) represents the coverage of a grid point. The action \(a_t^i \) is to select an action whether to hibernate (inactive) or sense (active). The reward \(r_{t+1}^i ( {s_t^i ,a_t^i })\) represents a ratio of the gain received for each grid point being covered to the state of the respective grid point. Higher positive rewards indicate higher gain and the grid point is covered by a single sensor node only; and hence, lower energy consumption. Each sensor node keeps track of the Q-value \(Q_{t+1}^i ( {s_t^i ,a_t^i })\) using Eq. (11).

Table 11 RL model for Tham’s sensing coverage scheme [4]

Similar RL model and algorithm have been applied to achieve the same purpose in [35], in which there are two levels of sensing range. Generally speaking, the RL model is redefined in [35], as shown in Table 12. The state \(s_t^i \) represents the coverage of a grid point. The action \(a_t^i \) is to select whether to hibernate (inactive), or sense (active) with a short-range, or a long-range coverage. The reward \(r_{t+1}^i ( {s_t^i })\) represents the gain received for each grid point being covered minus the cost associated with energy consumption. Long-range sensing incurs higher cost. Higher positive rewards indicate higher gain and the grid point is covered by lesser number of sensor nodes; and hence, lower energy consumption.

Table 12 RL model for Tham’s sensing coverage scheme [35]

3.5 Algorithm 5: model-based RL model

Convergence to an optimal policy can be achieved after some learning time; however, due to the dynamicity of the operating environment, the convergence rate is unpredictable. While higher learning rate \(\alpha \) may increase the convergence speed; the Q-value may fluctuate, particularly when the dynamicity of the operating environment is high because the Q-value is more dependent on its recent estimates now, rather than its previous experience [37]. Model-based RL model has been applied to increase the convergence speed. This approach estimates the state transition probability matrix \({\mathbf{T}}\), which forms the model and represents the operating environment, and subsequently updates the Q-values using \({\mathbf{T}}\). The state transition probability matrix \({\mathbf{T}}\) is a matrix comprised of the probability of transitioning from one state to another in a single time instant.

3.5.1 Hu’s routing scheme

Hu and Fei [36] propose a routing scheme A(3) using a model-based RL model, and it has been shown to increase throughput P(1), as well as to reduce energy consumption P(3). The purpose is to select a next-hop neighbor node with higher residual energy, which subsequently sends packets towards the sink node.

Table 13 shows the RL model for the scheme. Note that, the model is a representation for a particular packet. In other words, the RL model is embedded in each packet. The state \(s_t^i \) represents the node in which a particular packet resides (or node \(i)\). The action \(a_t^i \) represents the selection of a next-hop neighbor node \(j\). The reward \(r( {s_t^i ,\ a_t^i })\) represents various types of energies, including transmission and residual energies, incurred for forwarding a packet to node \(a_t^i =j\). Taking into account the residual energy avoids highly utilized routes (or hot spots) in order to achieve a balanced energy distribution among routes.

Table 13 RL model for Hu’s routing scheme [36]

Node \(i\)’s Q-function, which indicates the appropriateness of transmitting a packet from node \(i\) to node \(a_t^i =j\), is updated at time\(\ t+1\) as follows:

$$\begin{aligned} Q_{t+1}^i ( {s_t^i ,\ j})=\;r\;( {s_t^i ,\ a_t^i })+\gamma \left( {P_{s_t^i s_t^i }^{a_t^i } \mathop {\max \nolimits }\limits _{k\in a_t^i } Q_t^i ( {s_t^i ,k})+P_{s_t^i s_t^j }^{a_t^i } \mathop {\max \nolimits }\limits _{k\in a_t^j } Q_t^j ( {s_t^j ,k})}\right) \quad \quad \end{aligned}$$
(12)

where \(P_{s_t^i s_t^i }^{a_t^i } \) is the transition probability of an unsuccessful transmission from \(s_t^i \) (or node \(i)\) after taking action \(a_t^i \), while \(P_{s_t^i s_t^j }^{a_t^i } \) is the transition probability of a successful transmission from \(s_t^i \) to \(s_t^j \) (or node \(j)\) after taking action \(a_t^i \). The transition probabilities, which form the system model, estimate \(P_{s_t^i s_t^i }^{a_t^i } \) and \(P_{s_t^i s_t^j }^{a_t^i } \) using the historical data of each link’s successful and unsuccessful transmission rates based on the outgoing traffic of next-hop neighbor nodes.

3.6 Algorithm 6: hierarchical RL model

The hierarchical RL model segregates the entire system into the upper and lower levels, and applies two separate RL approaches simultaneously in each level to achieve global and local optimal actions, respectively. Hence, a large and complex problem, such as routing [25], can be segregated into smaller problems, which can be solved simultaneously. Hence, the hierarchical RL model helps to reduce the state space, and so it improves scalability and convergence rate.

3.6.1 Hu’s hierarchical routing scheme

Hu and Fei [25] propose a hierarchical routing scheme A(3) for clustered networks using a hierarchical RL model, and it has been shown to increase throughput P(1), as well as to reduce end-to-end delay P(2) and energy consumption P(3). The purpose is to select a next-hop node, which subsequently sends packets towards the sink node, with higher successful transmission rate.

Traditionally, in flat (or non-clustered) networks, Q-values keep track of a route cost comprised of multiple hops; however, any updates on a link cost must be propagated along a route, which may be slow in large networks. Hence, in [25], a clustered network using hierarchical routing A(3) is proposed. There are two types of routing schemes, namely intra- and inter-cluster routing schemes. The clusterhead performs inter-cluster routing to search for the best route to the sink node. The member nodes perform intra-cluster routing to search for the best route to a gateway node in the cluster. Gateway nodes are the member nodes located at the fringe of a cluster, and since they can hear from neighboring clusters, they provide inter-cluster communications. With respect to the hierarchical RL model, the upper and lower levels represent the inter-cluster and intra-cluster routing schemes, respectively. Hence, the upper and lower layers are comprised of clusterheads and member nodes, respectively. The upper layer supervises the lower layer so that member nodes search for the best route to the gateway node selected by the upper layer; while the lower layer provides evaluation feedback to the upper layer on the selection of the gateway node. This approach improves the sensitivity of a route towards changes in the network topology because an update now traverse a smaller number of hops and is confined to a particular cluster.

Table 14 shows the RL model for the scheme. Note that, the model is a representation for a particular packet. The state \(s_t^i \) represents the node in which a particular packet resides (or node \(i)\). The action \(a_t^i \) represents the selection of a next-hop neighbor node \(j\); and so a series of actions will lead the packet to the gateway nodes rather than the sink node in traditional networks. Note that, there are two levels of Q-learning for intra- and inter-cluster routing schemes respectively; and Table 14 shows intra-cluster routing being implemented within each cluster. The reward representation is \(r_{t+1}^i ( {a_t^i })=-1\) and \(r_{t+1}^i ( {a_t^i })=-c\) if successful and unsuccessful packet transmissions, respectively. The negative reward indicates resource consumption and network performance deterioration (i.e. energy consumption and packet latency) for each packet transmission.

Table 14 RL model for Hu’s hierarchical routing scheme [25]
Table 15 Performance enhancement of RL-based schemes in WSNs

4 Performance enhancements

RL has been shown to achieve the following performance enhancements as shown in Table 15:

  1. P.1

    Higher throughput. Higher throughput indicates higher packet delivery rate, higher successful packet transmission rate, lower packet loss rate and lower number of packet retransmissions.

  2. P.2

    Lower end-to-end delay/packet latency. Lower end-to-end delay and packet latency in single-hop and multi-hop transmissions, respectively, indicate lower number of packets in the buffer queue.

  3. P.3

    Lower energy consumption. Lower energy consumption increases network lifetime. Since each sensor node operates on battery power, energy consumption is a common performance metrics. Other performance enhancements, such as higher throughput and lower end-to-end delay, may indicate lower energy consumption due to lower packet loss rate and number of packet retransmissions.

  4. P.4

    Higher sensing coverage. Higher sensing coverage indicates that, larger parts of an area is covered by at least a single sensor node, and so there is higher rate of detection of the events of interest. A sensing coverage scheme A(5) must minimize the overlapping of sensing coverage with neighboring nodes to reduce energy consumption.

  5. P.5

    Higher route discovery rate. Higher route discovery rate indicates higher success rate of finding a favorable route from a source node to a sink node. In [28], a favorable route must be free from malicious nodes, which drop packets received from previous hops.

  6. P.6

    Higher in-contact time. Higher in-contact time indicates greater possibility of a sensor node to discover the presence of a mobile data collector node, as well as longer duration for data transmission, in a sleep-wake scheduling scheme A(1.1) [12].

5 Open issues

This section discusses open issues that can be pursued in this research area.

5.1 Convergence rate and energy consumption

The convergence rate is affected by how much an agent learns, as well as the condition of the operating environment. Generally speaking, convergence rate increases with longer waking or active time (or duty cycle), as well as lower dynamicity and uncertainty levels of the operating environment. Shorter waking time reduces energy consumption and convergence rate; and it is suitable for operating environment with lower dynamicity and uncertainty levels. On the other hand, longer waking time increases convergence rate and energy consumption; and it is suitable for operating environment with higher dynamicity and uncertainty levels. Future research could be pursued to adjust waking duration in order to achieve a balanced tradeoff between convergence rate and energy consumption with respect to the condition of the environment.

5.2 Enhancement on the scalability of RL

The Q-table is a two-dimensional lookup table comprised of \(\vert \text{ S }\vert \times \vert \text{ A }\vert \) entries. There are two important considerations with respect to scalability. Firstly, the number of entries increases exponentially with the number of states and actions; however, there may be limited memory capacity at each sensor node. Secondly, large number of state-action pairs requires higher number of explorations to discover most Q-values, and so it reduces the convergence rate to optimal action and increases energy consumption associated with learning (see Sect. 5.3). In addition to the state-action pairs, each sensor node must estimate and keep track of state transition probability matrix \(\text{ T }\) in model-based RL model (see Sect. 3.5). Future research could be pursued to reduce the number of state-action pairs, as well as the state pairs in model-based RL model, without jeopardizing the accuracy of state and action space representations in order to improve scalability, as well as to reduce energy consumption.

5.3 Minimization of learning cost

The condition of the operating environment affects the usefulness and the recentness of the knowledge (or Q-value). An agent must make the right decision whether to learn or not. Generally speaking, learning should only take place if these two conditions are fulfilled. Specifically, the new knowledge remains useful for a long enough period of time. This means that the operating environment remains consistent for a long enough period of time (or with a certain low levels of dynamicity and uncertainty), and the rewards (e.g. throughput) received must be greater than the learning costs (e.g. energy consumption and processing power). Future research could be pursued to investigate mechanisms to make the right decisions whether to learn or not, as well as to reduce the learning cost.

5.4 Enhancement on the security aspect of RL

The requirements of the sensors and sink nodes to observe and learn from the operating environment have inevitably opened new security vulnerabilities. Malicious nodes may manipulate the operating environment so that the nodes learn the incorrect information. The malicious nodes may have learning capabilities and launch attacks in a cooperative manner. Consequently, the sensor nodes may converge to suboptimal actions, take longer to converge, or even fail to converge to optimal actions. This security vulnerability may be particularly pronounced in the cooperative RL model (see Sect. 3.4), in which nodes that receive manipulated information may become malicious themselves since they exchange information (e.g. states and Q-values) among themselves. Consequently, manipulated nodes may make sub-optimal local actions as part of the joint action. Future research could be pursued to investigate mechanisms to enhance the security vulnerabilities associated with the application of RL in WSNs.

5.5 Reduction of message exchange overhead

The requirement of the sensors and sink nodes to exchange information (e.g. states and Q-values) among themselves in order to learn from each other and the operating environment have inevitably increased the amount of control message exchange and energy consumption. However, message exchange is essential for learning in cooperative and hierarchical RL models (see Sects. 3.4, 3.6, respectively). Nevertheless, by reducing the message exchange frequency, the convergence rate to an optimal joint action may decrease. Future research could be pursued to investigate mechanisms to reduce the message exchange frequency without jeopardizing network-wide performance.

6 Conclusions

Reinforcement Learning (RL) has been applied in Wireless Sensor Networks (WSNs) to provide network performance enhancement in a wide range of schemes. To apply RL, several representations including state, action, as well as delayed and discounted rewards, are defined. Additionally, several features, including the Q-function, as well as exploration and exploitation mechanisms, must be defined. Based on the context of WSNs, this article presents an extensive review on the enhancements of these representations and features. Most importantly, this article presents an extensive review on a wide range of RL models and enhanced RL algorithms in the context of WSNs. The enhanced algorithms provide insights on how various schemes in WSNs can be approached using RL. Performance enhancements achieved by the traditional and enhanced RL algorithms in WSNs are presented. Certainly, there is a great deal of future work in the use of RL, and we have raised open issues in this article.