Application of reinforcement learning to wireless sensor networks: models and algorithms

Yau, Kok-Lim Alvin; Goh, Hock Guan; Chieng, David; Kwong, Kae Hsiang

doi:10.1007/s00607-014-0438-1

Application of reinforcement learning to wireless sensor networks: models and algorithms

Published: 28 December 2014

Volume 97, pages 1045–1075, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Computing Aims and scope Submit manuscript

Application of reinforcement learning to wireless sensor networks: models and algorithms

Download PDF

Kok-Lim Alvin Yau¹,
Hock Guan Goh²,
David Chieng³ &
…
Kae Hsiang Kwong⁴

1797 Accesses
38 Citations
Explore all metrics

Abstract

Wireless sensor network (WSN) consists of a large number of sensors and sink nodes which are used to monitor events or environmental parameters, such as movement, temperature, humidity, etc. Reinforcement learning (RL) has been applied in a wide range of schemes in WSNs, such as cooperative communication, routing and rate control, so that the sensors and sink nodes are able to observe and carry out optimal actions on their respective operating environment for network and application performance enhancements. This article provides an extensive review on the application of RL to WSNs. This covers many components and features of RL, such as state, action and reward. This article presents how most schemes in WSNs have been approached using the traditional and enhanced RL models and algorithms. It also presents performance enhancements brought about by the RL algorithms, and open issues associated with the application of RL in WSNs. This article aims to establish a foundation in order to spark new research interests in this area. Our discussion has been presented in a tutorial manner so that it is comprehensive and applicable to readers outside the specialty of both RL and WSNs.

Experimental validation of a reinforcement learning based approach for a service-wise optimisation of heterogeneous wireless sensor networks

Article 09 October 2014

Energy-efficiency opportunistic spectrum allocation in cognitive wireless sensor network

Article Open access 15 January 2018

A novel algorithm for wireless sensor network routing protocols based on reinforcement learning

Article 19 October 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Wireless sensor network (WSN) [1] is comprised of a large number of sensors and sink nodes to monitor events or environmental parameters, such as temperature and humidity, in a collaborative manner. The sensor nodes collect data and send it to the destination sink nodes in single or multiple hops; while the sink nodes process the data in order to provide meaningful information to end users. The WSN has seen numerous potential applications in medical field, disaster recovery and wildlife monitoring.

Generally speaking, each sensor node operates on battery power. Two main factors affect energy consumption. Firstly, the state of the transceiver in which energy consumption is high during transmission, reception, idle (or overhearing), and low during sleeping. Secondly, events other than successful packet transmission, including collision, retransmission and control packet transmission, incur energy consumption. Enhancing energy efficiency to prolong network lifetime without jeopardizing network performance has attracted a considerable research attention, and has been part of the objective of most schemes in WSNs because sensor nodes may be deployed at hard-to-reach areas.

In recent years, there has been an increasing interest in the application of an artificial intelligence approach called Reinforcement learning (RL) [2] to various schemes in WSNs in order to improve network performance. The RL approach adopts an unsupervised and online learning technique. Through unsupervised learning, external teacher or critic is not required to oversee the learning process; and so, a decision maker (or an agent) must make its own efforts to learn knowledge about the operating environment. Through online learning, an agent acquires knowledge on the fly while carrying out its normal operation; and so, empirical data or experimental results from the laboratory are not required. A wide range of schemes can be represented using RL models, and subsequently various network performances can be improved using RL algorithms.

Although extensive research has been carried out on a wide range of schemes in WSNs, no single study exists which adequately covers distinctive RL models and algorithms that have been applied, and so this is the focus of this article. The rest of this article is organized as follows. Sections 1.1 and 1.4 present an overview of RL and application schemes of WSNs, respectively. In the context of WSNs, Sect. 2 presents various components, features and enhancements of RL, while Sect. 3 presents various RL models and algorithms. Section 4 presents performance enhancements brought about by RL in various schemes. Section 5 presents open issues, and finally Sect. 6 presents conclusions.

1.1 Overview of reinforcement learning

This section presents an overview of the RL model and Q-learning.

1.2 RL model

Figure 1 presents a simplified version of a RL model. The purpose of RL is to estimate the long-term reward of each state-action pair through trial-and-error interactions with the operating environment. There are three main representations. Firstly, state represents the decision-making factors (or the operating environment) under consideration being observed by an agent. Examples are residual energy and the number of packets in the buffer queue. Secondly, action represents an optimal action being selected by the agent, which may change or affect the state and reward. Examples are selecting transmission power and selecting a next-hop node for packet transmission. Thirdly, reward represents the gains or losses in network performance for taking an action on a particular state in the previous time instant. Examples are throughput and energy consumption level.

At any time instant, an agent observes state and reward from its operating environment, learns the long-term reward of each state-action pair, decides and carries out an appropriate action on the environment so that the state and reward, which are the consequences of the action, improve in the next time instant. The agent interacts with the operating environment in a trial-and-error manner, and so given a particular state, an agent learns to carry out the optimal action as time progresses in order to improve the next state and reward.

The RL model in Fig. 1 can be embedded in each sensor node [3], or in the surrounding area of a sensor node [4]. For instance, each sensor node keeps track of the reward in regards to each neighboring sensor node in [5], and for each grid point in its surrounding operating environment in [4]. As an example on the application of RL in WSNs, it is used to learn the optimal route in routing (see Sect. 1.4). The state represents a destination (or sink) node, action represents the selection of a next-hop node to forward packets, and reward represents the progress in terms of the physical distance towards the sink node. Maximizing reward reduces distance to the sink node, which enhances network performance.

There are two main advantages of RL. Firstly, it models network performance which covers most factors affecting the performance rather than each of the factors itself, and so this simplifies the design. Secondly, it learns on the fly during normal operation, and so it does not require prior knowledge of the operating environment. For instance, a sleep-wake scheduler aims to reduce energy consumption through sleeping for the right duration at the right time, and so traffic loads at neighboring nodes are pertinent to determine this duration although this information may not be known to the scheduler.

1.3 Q-learning

Q-learning [6] is a popular technique in RL. Denote decision epochs by $t\in T=\{1,2,\ldots \}$, each agent $i$ updates the Q-function $Q_{t+1}^i ( {s_t^i ,a_t^i })$ of a particular state-action pair at time $t$ as follows:

$$\begin{aligned} Q_{t+1}^i ( {s_t^i ,a_t^i })\leftarrow ( {1-\alpha })Q_t^i ( {s_t^i ,a_t^i })+\alpha \left[ {r_{t+1}^i ( {s_{t+1}^i })+\gamma \max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})} \right] \end{aligned}$$

(1)

where $s_t^i \in S$ is state, $a_t^i \in A$ is action, $r_{t+1}^i ( {s_{t+1}^i })\in R$ is delayed reward, $0\le \gamma \le 1$ is discount factor, and $0\le \alpha \le 1$ is learning rate. Note that, the delayed reward $r_{t+1}^i ( {s_{t+1}^i })$ for action selection at time $t$ is dependent on the state at time $t+1$ and so it is received at time $t+1$. Also note that, higher $\gamma $ value causes greater dependency on the discounted future reward $\gamma \max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})$ rather than the delayed reward $r_{t+1}^i (s_{t+1}^i )$; while higher $\alpha $ value causes greater dependency on the delayed reward $r_{t+1}^i ( {s_{t+1}^i })$ and the discounted future reward $\gamma \max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})$ rather than the Q-value $Q_t^i ( {s_t^i ,a_t^i })$ at time $t$.

An agent $i$ observes state $s_t^i $ from the operating environment and chooses an action $a_t^i $ at decision epoch $t$. The state $s_t^i $ changes to $s_{t+1}^i $ at decision epoch $t+1$. Subsequently, the agent receives delayed reward $r_{t+1}^i ( {s_{t+1}^i })$ and updates Q-value $Q_{t+1}^i ( {s_t^i ,a_t^i })$ using Eq. (1). The Q-value $Q_{t+1}^i ( {s_t^i ,a_t^i })$ is updated using the maximum discounted future reward $\gamma \max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})$ as the agent takes the optimal action in any future states at time $t,t+1,\ldots $. As time progresses, the agent receives a sequence of rewards which contribute to the convergence of the Q-values to long-term rewards. The agent chooses an optimal action through maximizing value function $V^\pi (s_t^i )$ as shown below:

$$\begin{aligned} V^\pi ( {s_t^i })=\max \nolimits _{a\in A} (Q_t^i ( {s_t^i ,a})) \end{aligned}$$

(2)

Hence, agent $i$’s policy is as follows:

$$\begin{aligned} \pi _i ( {s_t^i })=\text{ argmax }_{a\in A} (Q_t^i ( {s_t^i ,a})) \end{aligned}$$

(3)

In some cases, negative reward represents cost, which must be minimized, and so $V^\pi ( {s_t^i })=\min _{a\in A} (Q_t^i ( {s_t^i ,a}))$ and $\pi _i ( {s_t^i })=\text{ argmin }_{a\in A} (Q_t^i ( {s_t^i ,a}))$. Note that, choosing the optimal action using Eq. (3) at all times does not update the Q-values of other actions, which may cause the agent to converge to local optimal solutions. Hence, there are two methods for action selection. Exploitation chooses the best-known optimal action for performance enhancement; while exploration chooses the other actions once in a while to update the Q-values of other actions so that better actions may be discovered.

1.4 Application schemes of wireless sensor networks

Reinforcement learning is a versatile and universal solution to most problems and open issues associated with the dynamicity and uncertainty of the operating environment. RL has been applied in various schemes in WSNs as follows:

A.1
Medium access control (MAC) MAC protocols coordinate channel access among multiple nodes in a single-hop transmission to reduce collisions. Two main functions are sleep-wake scheduler [7–9] and transceiver selector [10] as follows:
1. A.1.1
  Sleep-wake scheduler arranges the transmission, reception, idle and sleeping time durations. During the idle mode, sensor nodes listen for potential packet transmissions and the energy consumption is almost identical to that of receive mode. To reduce energy consumption, a sleep-wake scheduler schedules sleeping and waking (i.e. transmission, reception and idle) time durations. There are two main purposes in sleep-wake scheduling. Firstly, longer waking time duration (or higher duty cycle) increases bandwidth availability leading to higher throughput and lower packet latency; however, it increases energy consumption. The waking time duration may increase with network traffic load [8, 9] or Quality of Service (QoS) requirements [11]. RL has been applied to minimize collisions and energy consumption in slot assignment [7], as well as to estimate traffic arrivals from neighboring nodes in order to adjust the sleeping and waking time durations [8, 9]. Secondly, a mobile data collector node moves within an area to collect sensing outcomes from static sensor nodes [12]. RL has been applied in each sensor node to learn the waking time duration based on the arrival pattern of the mobile data collector node [12] in order to increase in-contact with the mobile data collector node while reducing energy consumption.
2. A.1.2
  Transceiver selector selects either a long-range or short-range radio for data and control packet transmissions. Long-range (short-range) radio uses higher (lower) transmission power. To reduce energy consumption, a transceiver selector switches in between the transceivers based on physical range (e.g. whenever a mobile node moves from one effective transmission range to another) and channel conditions (e.g. fading, interference, shadowing, and multi-path effects) [10].
A.2
Cooperative communications select cooperative forward packets towards sink nodes in order to ameliorate the effects of deteriorating channel conditions and changes in network topology. For instance, in forwarding nodes to Fig. 2, the direct transmission $i\rightarrow j$ (or from node $i$ to forwarding node $j)$ is unsuccessful. Any packet retransmission through direct transmission $i\rightarrow j$ may still be unsuccessful if the channel experiences deep fading for a long period of time. Since node $k$ overhears the packet, cooperative communication enables indirect transmission $i\rightarrow k\rightarrow j$. Since there may be a number of potential cooperative nodes $k\in K$, RL has been applied at node $i$ to select a cooperative node [13], thereby providing spatial and time diversity gains.

A.3
Routing enables a sensor node to search for the best route to a sink node in clustered [14] and non-clustered networks [5]. Generally speaking, clustering segregates the entire network into groups with each consists of a clusterhead and member nodes. The clusterhead collects, processes and aggregates sensing outcomes received from member nodes, and subsequently send them to the sink node through single or multiple hops. RL has been applied in each sensor node to learn the best route to the sink node.
A.4
Rate control adjusts the packet transmission rate of a source node, and hence the congestion level of intermediate nodes, along a route [15, 16].
A.5
Sensing coverage is a WSN application that maximizes the physical sensing coverage of an area so that any event of interest is accurately detected by at least a single sensor node. Sensing coverage can be applied in surveillance and monitoring tasks (e.g., intruder and fire detections). To reduce energy consumption, RL has been applied in each sensor node to minimize the overlapping of sensing coverages [17].
A.6
Task scheduling schedules and carries out the right task at different time instant. For instance, in [18], RL has been applied in each sensor node to learn the usefulness of each task (i.e. sensing, transmitting, receiving, aggregating data, and sleeping) at different time instant in order to reduce energy consumption.

2 Reinforcement learning: components, features and enhancements

This section presents the traditional and enhanced components and features of RL in the context of WSNs. For each component and feature, we show the traditional approach and subsequently the alternative or enhanced approaches.

2.1 State

Traditionally, each state is comprised of a single type of information. For instance, each state $s_t^i \in S=\left\{ {1,2,\ldots ,K} \right\} $ represents the number of packets in the buffer queue [8]. The state representation can be enhanced in two ways. Firstly, the state may not be represented because there is a single state only, and this is called stateless [7]. Secondly, each substate may be comprised of distinctive substates. For instance, state ${\mathbf{s}}_{\mathbf{t}}^{\mathbf{i}} =( {s_{x,t}^i ,s_{y,t}^i })\in S$, where $s_{x,t}^i \in S_x $ and $s_{y,t}^i \in S_y $ represent a set of potential neighboring nodes and data flows, respectively) [32].

The state representation can be further enhanced through minimization of Hamming distance for state space. In general, larger state space increases memory requirement, and reduces the convergence rate to optimal actions since an agent must explore more state-action pairs. The number of states can be reduced based on Hamming distance [12]. An agent calculates the weighted Hamming distance between two states, specifically $H( {s_1 -s_2 })=W_1 \cdot \left| {V_1 ( {s_1 })-V_1 ( {s_2 })} \right| +W_2 \cdot \left| {V_2 ( {s_1 })-V_2 ( {s_2 })} \right| +\cdots +W_N \cdot \vert V_N ( {s_1 })-V_N ( {s_2 })\vert $, where the weight $W_n $ represents the significance of the corresponding variable $\vert V_n ( {s_1 })-V_n ( {s_2 })\vert $ in differentiating the two states. Both states $s_1 $ and $s_2 $ share a single entry in the Q-table if their Hamming distance is less than a threshold or $H( {s_1 -s_2 })<H_T $.

2.2 Action

Traditionally, each action represents a single action out of a set of possible actions. For instance, in a routing scheme A(3) [5, 14], each action $a_t^i \in A=\{1,2,\ldots ,K\}$ represents a next-hop node for packet transmission, while $A$ represents a set of all neighbor nodes. The action representation can be enhanced in two ways. Firstly, each action may be represented by subactions. For instance, in [7, 13], action ${\mathbf{a}}_{\mathbf{t}}^{\mathbf{i}} =( {a_{1,t}^i ,a_{2,t}^i ,\ldots ,a_{K,t}^i })\in A_1 \times A_2 \times \cdots \times A_K $, where $a_{k,t}^i \in A_k =\{0,1\}$. Secondly, each subaction may further be comprised of distinctive subactions. For instance, in [32], action ${\mathbf{a}}_{\mathbf{t}}^{\mathbf{i}} =( {a_{x,t}^i ,a_{y,t}^i })\in A$, where $a_{x,t}^i \in A_x =\{0,1\}$ and $a_{y,t}^i \in A_y =\{a_{y,1,t}^i ,a_{y,2,t}^i ,\ldots ,a_{y,K,t}^i \}$.

2.3 Delayed reward

Traditionally, each delayed reward represents the performance enhancement achieved by a state-action pair. A single reward computation approach is applicable to all state-action pairs. For example, in a sleep-wake scheduler A(1.1) [8], the delayed reward $r_{t+1}^i ( {a_{t+1}^i })$ is a ratio of the effective transmission and reception time durations to the waking time duration. Additionally, the delayed reward can be a constant value, such as $r_{t+1}^i ( {a_{t+1}^i })=1$ or $-1$ to indicate successful and unsuccessful transmissions [7]. The delayed reward can be further enhanced in the context of WSNs as described next.

2.3.1 Distinctive reward functions

Different reward functions can be used to compute rewards under distinctive network conditions [5, 10].

As an example, in a transceiver selector A(1.2) [10], the action is to select a transceiver and its transmission power level for packet transmissions. The cost (or negative reward) for each packet transmission depends on the amount of energy consumption, and it is a function of the number of retransmissions, MAC delays (i.e. channel sensing and backoff), transmission and reception power levels, as well as packet size. Higher cost indicates higher energy consumption. However, there is a condition in which the reward computation is different. Specifically, if transmissions are unsuccessful even though the highest transmission power has been used, a zero reward value is assigned in order to avoid the agent from exploring other actions with lower transmission power levels until the transmissions are successful at the highest transmission power level.

2.3.2 Average delayed reward function

Traditionally, delayed reward is an instant value. The application of an average delayed reward has been shown to improve the overall system performance [19], and it has been applied in [16, 20].

As an example, in [16], the average delayed reward is as follows:

$$\begin{aligned} r_{a,t+1}^i ( {s_{t+1}^i })&\leftarrow r_{a,t}^i ( {s_{t+1}^i })+\alpha _r \Big [ r_t^i ( {s_{t+1}^i })-r_{a,t}^i ( {s_{t+1}^i })\nonumber \\&\qquad +\max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})-\max \nolimits _{a\in A} Q_t^i ( {s_t^i ,a}) \Big ] \end{aligned}$$

(4)

where $r_{a,t}^i ( {s_{t+1}^i })$ represents the average delayed reward, and $\alpha _r $ represents the learning rate of the average delayed reward computation. The Q-function (1) is rewritten to incorporate the average delayed reward as follows:

$$\begin{aligned} Q_{t+1}^i ( {s_t^i ,a_t^i })&\leftarrow ( {1-\alpha })Q_t^i ( {s_t^i ,a_t^i })\nonumber \\&\qquad +\,\alpha \left[ {r_{t+1}^i ( {s_{t+1}^i })-r_{a,t}^i ( {s_{t+1}^i })+\gamma \max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})} \right] \end{aligned}$$

(5)

In [16], the average delayed reward approach is applied in congestion avoidance A(4) to adjust the packet transmission rate of a source node in order to adjust the congestion level. The state represents the number of packets in the buffer queue; the action selects a next-hop node and a packet transmission rate; and the delayed reward is a function of energy efficiency and packet loss rate.

As another example, in [20], the average delayed reward is as follows:

$$\begin{aligned} r_{a,t+1}^i ( {s_{t+1}^i })\leftarrow (1-\alpha _r )\cdot r_{a,t}^i ( {s_{t+1}^i })+\alpha _r \cdot r_t^i ( {s_{t+1}^i }) \end{aligned}$$

(6)

The Q-function (1) is rewritten to incorporate the average delayed reward as follows:

$$\begin{aligned} Q_{t+1}^i ( {s_t^i ,a_t^i })&\leftarrow ( {1-\alpha })Q_t^i ( {s_t^i ,a_t^i })\nonumber \\&\qquad +\,\alpha \left[ {r_{t+1}^i ( {s_{t+1}^i })-\gamma (r_{a,t}^i ( {s_{t+1}^i })+\max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a}))} \right] \quad \quad \end{aligned}$$

(7)

In [20], the average delayed reward approach is applied in a sleep wake scheduler A(1.1) to adjust the waking time duration (or duty cycle), transmission power and modulation levels in order to reduce energy consumption. The state represents channel gain and the number of packets in the buffer queue; the action selects the waking time duration, as well as transmission power and modulation levels; and the reward is a ratio of the number of received and transmitted packets to energy consumption and the processing cost in the buffer queue.

2.4 Discounted reward

Traditionally, the discounted reward has been applied to indicate the dependency of Q-function on future rewards. As an example, in a routing scheme A(3) [5], the delayed reward represents the link cost from a node to a next-hop node; while the discounted reward $\gamma \max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})$ represents the route cost from the next-hop node to a sink node, which may be multiple hops away. The discounted reward may be omitted with $\gamma =0$ to show the lack of dependency on future rewards, and this approach is generally called the myopic approach which enables an agent to adapt to instantaneous changes in the operating environment [21]; and further discussion on this approach is presented in Sect. 3.1. The discounted reward can be further enhanced using average discounted reward function. Generally speaking, the future reward may be uncertain in some cases, and so an agent may be uncertain about its action selection. In [22], the average discounted Q-value is computed for all possible actions; and hence $\max \nolimits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})$ in Eq. (1) is replaced by $\mathop \sum _{a\in A} [P(a)\times Q_t^i ( {s_{t+1}^i ,a})]/\mathop \sum _{a\in A} P(a)$. Note that, if all possible actions are taken into account, then $\mathop \sum _{a\in A} P( a)=1$.

2.5 Q-function

The traditional Q-function (see Eq. (1)) can be further enhanced.

2.5.1 Q-value Initialization

Generally speaking, the Q-values are initialized to a certain value (e.g. a zero value) so that all possible actions are given a fair chance during exploration. However, it can be initialized with different values to speed up rate. For instance, in a cooperative communication scheme A(2) [13], the Q-values are initialized based on the distance between a node $i$ and its next-hop node $j$ in which higher Q-values indicate more favorable nodes in making the progress in terms of the physical distance towards a sink node in order to reduce end-to-end delay.

2.5.2 Reward equivalent Q-function

In [23, 24], the learning rate and discount factor are set to $\alpha =1$ and $\gamma =0$, so the Q-function equals delayed reward $Q_{t+1}^i ( {a_t^i })=r_{t+1}^i ( {a_t^i })$, and it is applied to speed up the learning process in a routing scheme A(3). In [23], a node $i$ selects its next-hop node $a_t^i $ and updates its Q-value $Q_{t+1}^i ( {a_t^i })=r_{t+1}^i ( {a_t^i })=c_{a_t^i } +\text{ min }_a Q_t^{a_t^i } ( a)$, where $c_{a_t^i } $ represents the link cost between node $i$ and its next-hop node $a_t^i $, and $\text{ min }_a Q_t^{a_t^i } ( a)$ indicates that node $a_t^i $ chooses its next-hop node with the minimum Q-value. Note that, nodes must exchange Q-values, which indicates the route cost to a destination sink node, among themselves.

2.6 Exploration and exploitation

Traditionally, there are two popular approaches to achieve a balanced trade-off between exploration and exploitation, namely softmax and $\varepsilon $-greedy [2] which have been applied in [4, 5, 8, 14, 23, 25], respectively. For instance, in [8], an agent chooses exploration actions with a small probability $\varepsilon $ and exploitation actions with a probability $1-\varepsilon $. In [21], during initial exploration, an agent explores all the available actions in a round-robin manner in order to discover the Q-values of all actions [21]. The exploration and exploitation mechanism can be further enhanced through adjusting the exploration probability.

The exploration probability may be adjusted based on the uncertain and dynamic levels of the operating environment due to nodal mobility and varying channel conditions. As an example, in [26], using the $\varepsilon $-greedy approach, node $i$ adjusts its exploration probability $\varepsilon _t^i =n_{a+d,T}^i /n_T^i $, where $n_{a+d,T}^i $ represents the number of nodes that appear and disappear in node $i$’s transmission range within a time window $T$, and $n_T^i $ represents the number of node $i$’s neighboring nodes. As another example, in [12], node $i$ adjusts its exploration probability $\varepsilon _t^i =\varepsilon _{min} +\text{ max }[0,( {\varepsilon _{max} -\varepsilon _{min} })\times ( {e_{max} -e})/e_{max} ]$, where $e$ represents the number of events of interest with lower $e$ value increases the exploration probability $\varepsilon _t^i $.

The exploration probability may also be adjusted based on action selection. In [24], using the $\varepsilon $-greedy approach, node $i$ adjusts its exploration probability as follows:

$$\begin{aligned} \varepsilon _{t+1}^i =\left\{ {{\begin{array}{ll} {\varepsilon _t^i +\varepsilon _{step} ,} &{}\quad {if\ a_t^i \ne a_{t-1}^i } \\ {\varepsilon _t^i -\varepsilon _{step} ,} &{}\quad {\text{ otherwise }} \\ \end{array} }} \right. \end{aligned}$$

(8)

Note that, $\varepsilon _{t+1}^i =\varepsilon _t^i +\varepsilon _{step} $ helps to discover the optimal actions when the operating environment becomes unstable (i.e. when the consecutive actions change, or $a_t^i \ne a_{t-1}^i )$, while $\varepsilon _t^i =\varepsilon _t^i -\varepsilon _{step} $ helps to achieve the optimal actions when the operating environment becomes stable.

3 Reinforcement learning: algorithms

The traditional RL approach (see Sect. 1.1) has been applies in various schemes to provide performance enhancement in WSNs as shown in Table 1.

A major contribution of this section is the discussion on a number of new additions and enhancements to the traditional RL algorithms, which have been applied to various schemes in WSNs. A summary of the new RL models and algorithms is shown in Table 2. The following subsections describe the model and algorithm, including the purpose(s) of the scheme(s), followed by its associated RL model (i.e. state, action and reward representations), and finally the algorithm.

Table 1 RL models with direct application of the traditional RL approach for various schemes in WSNs

Application of reinforcement learning to wireless sensor networks: models and algorithms

Abstract

Similar content being viewed by others

Experimental validation of a reinforcement learning based approach for a service-wise optimisation of heterogeneous wireless sensor networks

Energy-efficiency opportunistic spectrum allocation in cognitive wireless sensor network

A novel algorithm for wireless sensor network routing protocols based on reinforcement learning

Explore related subjects

1 Introduction

1.1 Overview of reinforcement learning

1.2 RL model

1.3 Q-learning

1.4 Application schemes of wireless sensor networks

2 Reinforcement learning: components, features and enhancements

2.1 State

2.2 Action

2.3 Delayed reward

2.3.1 Distinctive reward functions

2.3.2 Average delayed reward function

2.4 Discounted reward

2.5 Q-function

2.5.1 Q-value Initialization

2.5.2 Reward equivalent Q-function

2.6 Exploration and exploitation

3 Reinforcement learning: algorithms

3.1 Algorithm 1: myopic RL model with \(\gamma =0\)

3.1.1 Chu’s slot assignment scheme for MAC protocol

3.1.2 Forster’s intra-cluster routing scheme

3.2 Algorithm 2: RL model with continuous space representation

3.2.1 Niu’s sleep-wake scheduling scheme for MAC protocol

3.3 Algorithm 3: RL model with directed exploration

3.3.1 Alberola’s sleep-wake scheduling scheme for MAC protocol

3.4 Algorithm 4: cooperative RL model

3.4.1 Cooperative RL algorithms

3.4.2 Application schemes with cooperative RL algorithms

3.5 Algorithm 5: model-based RL model

3.5.1 Hu’s routing scheme

3.6 Algorithm 6: hierarchical RL model

3.6.1 Hu’s hierarchical routing scheme

4 Performance enhancements

5 Open issues

5.1 Convergence rate and energy consumption

5.2 Enhancement on the scalability of RL

5.3 Minimization of learning cost

5.4 Enhancement on the security aspect of RL

5.5 Reduction of message exchange overhead

6 Conclusions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation