Application of reinforcement learning to routing in distributed wireless networks: a review

Al-Rawi, Hasan A. A.; Ng, Ming Ann; Yau, Kok-Lim Alvin

doi:10.1007/s10462-012-9383-6

Application of reinforcement learning to routing in distributed wireless networks: a review

Published: 08 January 2013

Volume 43, pages 381–416, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Artificial Intelligence Review Aims and scope Submit manuscript

Application of reinforcement learning to routing in distributed wireless networks: a review

Download PDF

Hasan A. A. Al-Rawi¹,
Ming Ann Ng¹ &
Kok-Lim Alvin Yau¹

2986 Accesses
67 Citations
Explore all metrics

Abstract

The dynamicity of distributed wireless networks caused by node mobility, dynamic network topology, and others has been a major challenge to routing in such networks. In the traditional routing schemes, routing decisions of a wireless node may solely depend on a predefined set of routing policies, which may only be suitable for a certain network circumstances. Reinforcement Learning (RL) has been shown to address this routing challenge by enabling wireless nodes to observe and gather information from their dynamic local operating environment, learn, and make efficient routing decisions on the fly. In this article, we focus on the application of the traditional, as well as the enhanced, RL models, to routing in wireless networks. The routing challenges associated with different types of distributed wireless networks, and the advantages brought about by the application of RL to routing are identified. In general, three types of RL models have been applied to routing schemes in order to improve network performance, namely Q-routing, multi-agent reinforcement learning, and partially observable Markov decision process. We provide an extensive review on new features in RL-based routing, and how various routing challenges and problems have been approached using RL. We also present a real hardware implementation of a RL-based routing scheme. Subsequently, we present performance enhancements achieved by the RL-based routing schemes. Finally, we discuss various open issues related to RL-based routing schemes in distributed wireless networks, which help to explore new research directions in this area. Discussions in this article are presented in a tutorial manner in order to establish a foundation for further research in this field.

A Model-Based Reinforcement Learning Algorithm for Routing in Energy Harvesting Mobile Ad-Hoc Networks

Article 03 February 2017

RLProph: a dynamic programming based reinforcement learning approach for optimal routing in opportunistic IoT networks

Article 28 April 2020

A novel algorithm for wireless sensor network routing protocols based on reinforcement learning

Article 19 October 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Compared to static wired networks, the dynamicity of various properties in distributed wireless networks, including mobility patterns, wireless channels and network topology, have imposed additional challenges to achieving network performance enhancement in routing. Traditionally, routing schemes use predefined sets of policies or rules. Hence, most of these schemes have been designed with specific applications in mind; specifically, each node possesses predefined sets of policies suitable for a certain network condition. Since the policies may not be optimal in other network conditions, the schemes may not achieve the optimal results most of the time due to the unpredictable nature of distributed wireless networks.

The application of Machine Learning (ML) algorithms to solve issues associated with the dynamicity of distributed wireless networks has gained a considerable attention (Forster 2007). ML algorithms help wireless nodes to achieve context awareness and intelligence for network performance enhancement. Context awareness enables a wireless node to observe its local operating environment; while intelligence enables the node to learn an optimal policy, which may be dynamic in nature, for decision making on its operating environment (Yau et al. 2012). In other words, context awareness and intelligence help a wireless node to take actions based on its observed operating environment in order to achieve optimal or near-optimal network performance.

Various ML techniques, such as Reinforcement Learning (RL) (Sutton and Barto 1998), swarm intelligence (Kennedy and Eberhart 1995), genetic algorithms (Gen and Cheng 1999), and neural networks (Forster 2007; Rojas 1996) have been applied to enhance network performance. The choice of a ML algorithm may be based on the characteristics of a distributed wireless network. Examples of distributed wireless networks are Wireless Sensor Networks (WSNs) (Akyildiz et al. 2002), wireless ad hoc networks (Toh 2001), cognitive radio networks (Akyildiz et al. 2009), and delay tolerant networks (Burleigh et al. 2003). In relation to the application of ML in distributed wireless networks, Forster (2007) provides a comparison of various ML algorithms. For instance, in Forster (2007), the application of RL is found to be more suitable for energy-constrained WSNs compared to the swarm intelligence approach. The rationale behind this is that, swarm intelligence usually incurs higher network overheads compared to RL, hence it may consume more energy. Meanwhile, genetic algorithm may be more suitable for centralized wireless networks because it requires global information (Forster 2007).

In distributed wireless networks, routing is a core component that enables a source node to find an optimal route to its destination node. Route selection may depend on the characteristics of the operating environment. Hence, the application of ML in routing schemes to achieve context awareness and intelligence has received a considerable research attention. For instance, in mobile networks, ML-based routing schemes are adaptive to the operating environment because it may be impractical for network designers to establish and develop routing policies for each movement in which the characteristics and parameters of the operating environment change with time and location (Ouzecki and Jevtic 2010; Chang et al. 2004).

This article provides an extensive survey on the application of various RL approaches to routing in distributed wireless networks. Our contributions are as follows. Section 2 presents an overview of RL. Section 3 presents an overview of routing in distributed wireless networks from the perspective of RL. Section 4 presents RL models for routing. Section 5 presents new RL features for routing. Section 6 presents an extensive survey on the application of RL to routing. Section 7 presents an implementation of a RL-based routing scheme in wireless platform. Section 8 presents performance enhancements brought about by the application of RL in various routing schemes. Section 9 presents open issues. Finally, we provide conclusions. All discussions are presented in a tutorial manner in order to establish a foundation and to spark new interests in this research field.

2 Reinforcement learning

Reinforcement Learning (RL) is a biological-based ML approach that acquires knowledge by exploring its local operating environment without the need of external supervision (Xia et al. 2009; Santhi et al. 2011). A learner (or agent) explores the operating environment itself, and learns the optimal actions based on a trial-and-error concept. RL has been applied to keep track of the relevant factors that affect decision making of agents (Sutton and Barto 1998). In distributed wireless networks, RL has been applied to model the main goal(s), particularly network performance metric(s) such as end-to-end delay, rather than to model all the relevant factors in the operating environment that affect the performance metric(s) of interest. Through learning the optimal policy on the fly, the goal of the agent is to maximize the long-term rewards in order to achieve performance enhancement (Ouzecki and Jevtic 2010).

A RL task that fulfills the Markovian (or memoryless) property is called a Markov Decision Process (MDP) (Sutton and Barto 1998). The Markovian property implies that the action selection of an agent at time $t$ is dependent on the state-action pairs at time $t-1$ only, rather than the past history at time $t-2, t-3, {\ldots }$. In MDP, an agent is modeled as a four-tuple consisting of $\{S, A, T, R\}$, where $S$ is a set of states, $A$ is a set of actions, $T$ is a state transition probability matrix that represents the probability of a switch from one state at time $t$ to another state at time $t+1$, and $R$ is a reward function that represents a reward (or cost) $r$ received from the operating environment. At time $t$, an agent observes state $s\in S$ and chooses action $a\in A$ based on its knowledge (or learned optimal policy). At time $t+1$, the agent receives a reward $r$. As time goes by, the agent learns and associates each state-action pair with a reward. In other words, the reward indicates the appropriateness of taking action $a\in A$ in state $s\in S$. Note that, MDP requires an agent to construct and keep track of a model of its dynamic operating environment in order to estimate the state transition probability matrix $T$. By omitting $T$, RL learns knowledge through constant interaction with operating environment.

Q-learning is a popular RL approach, and it has been widely applied in distributed wireless networks (Yau et al. 2012). In RL, an agent is modeled as a three-tuple consisting of $\{S, A, R\}$ as described below:

State. An agent has a set of states $S$ that represent the decision making factors observed from its local operating environment. At any time instant $t$, agent $i$ observes state $s_t^i \in S$. The state can be internal, such as buffer occupancy rate, or external, such as a destination node. The agent observes its state in order to learn about its operating environment. If the state is partially observable (i.e. operating environment with noise), an agent can estimate its state, which is commonly called the belief state, using Partially Observable Markov Decision Process (POMDP) (Sutton and Barto 1998).
Action. An agent has a set of available actions $A$. Examples of actions are data transmission and next-hop node selection. Based on the continuous observation and interaction with the local operating environment, an agent $i$ learns to select an action $a_t^i \in A$ that maximizes its current and future rewards.
Reward. Whenever an agent $i$ carries out an action $a_t^i \in A$, it receives a reward $r_{t+1}^i (s_{t+1}^i )$ from the operating environment. A reward $r_{t+1}^i (s_{t+1}^i )$ may represent a performance metric, such as transmission delay, throughput, and channel congestion level. Weight factor may be used to estimate the reward if there are two or more different types of performance metrics. For instance, in Dong et al. (2007), $r_{t+1}^i ( {s_{t+1}^i } )=\omega r_{a,t+1}^i ( {s_{t+1}^i } )+(1-\omega )r_{b,t+1}^i ( {s_{t+1}^i } )$ where $r_{a,t+1}^i ( {s_{t+1}^i } )$ and $r_{b,t+1}^i ( {s_{t+1}^i } )$ indicate rewards for different performance metrics, respectively; and $\omega $ indicates the weight factor. There are two types of rewards, namely delayed rewards and discounted rewards (or accumulated and estimated future rewards). Consider that an action is taken at time $t$, the delayed reward represents the reward received from the operating environment at time $t+1$; whereas the discounted reward is the accumulated rewards expected to be received from the operating environment in the long run at time $t+1, t+2, \ldots $. The agent aims to learn how to maximize its total rewards comprised of delayed and discounted rewards (Sutton and Barto 1998).

2.1 Q-learning model

Q-learning defines a Q-function $Q_t^i (s_t^i ,a_t^i )$, which is also called a state-action function. The Q-function estimates Q-values, which are the long-term rewards that an agent can expect to receive for each possible action $ a_t^i \in A$ taken in state $s_t^i \in S $. An agent $i$ maintains a Q-table that keeps track of Q-values for each possible state-action pair, so there are $|S|\times |A|$ entries. Subsequently, based on these Q-values, the agent derives an optimal policy $\pi $ that defines the best-known action $a_t^i $, which has the maximum Q-value for each state $s_t^i$. For each state-action pair ($s_t^i, a_t^i$) at time $t$, the Q-value is updated using Q-function as follows:

$$\begin{aligned} Q_{t+1}^i \left(s_t^i ,a_t^i \right)\leftarrow \left( {1-\alpha } \right)Q_t^i \left(s_t^i ,a_t^i \right)+\alpha \left[ {r_{t+1}^i \left(s_{t+1}^i \right)+\gamma \mathop {\max }\limits _{a\in A} Q_t^i \left( {s_{t+1}^i ,a} \right)} \right] \end{aligned}$$

(1)

where $0\le \alpha \le 1$ is learning rate, and $0\le \gamma \le 1$ is discount factor. Higher learning rate $\alpha $ indicates higher speed of learning, and it is normally dependent on the level of dynamicity in the operating environment. Note that, too high a learning rate may cause fluctuations in Q-values. If $\alpha =1$, the agent solely relies on its newly estimated Q-value $r_{t+1}^i (s_{t+1}^i )+\gamma \mathop {\max }\limits _{a\in A} Q_t^i ( {s_{t+1}^i ,a})$, and forgets its current Q-value $Q_t^i (s_t^i ,a_t^i )$. On the other hand, $\gamma $ enables the agent to adjust its preference on the long-term future rewards. Unless $\gamma =1$ in which both delayed and discounted rewards are given the same weight, the agent always gives more preference to delayed rewards.

2.2 Action selection: exploitation or exploration

During action selection, there are two types of actions, namely, exploitation and exploration. Exploitation selects the best-known action $ a_t^i =\text{ argmax}_{a\in A} Q_t^i ( {s_t^i ,a})$, which has the highest Q-value, in order to improve network performance. Exploration selects a random action $a_t^i \in A$ in order to improve knowledge, specifically, the estimation of the Q-values for all state-action pairs. A well-balanced tradeoff between exploitation and exploration helps to maximize accumulated rewards as time goes by. This tradeoff mainly depends on the accuracy of the Q-value estimation, and the level of dynamicity of the operating environment (Yau et al. 2012).

Upon convergence of Q-values, exploitation may be given higher priority because exploration may not discover better actions. A popular tradeoff mechanism is the $\varepsilon $-greedy approach in which the agent performs exploration with a small probability $\varepsilon $ (e.g. $\varepsilon =0.1)$ and exploitation with probability $1-\varepsilon $.

The $\varepsilon $-greedy approach may not be suitable in some scenarios because exploration selects non-optimal actions randomly with equal probability, hence the worst action with the lowest Q-value may be chosen (Sutton and Barto 1998). A popular softmax approach based on the Boltzmann distribution has been applied to choose non-optimal actions for exploration, and actions with higher Q-values are given higher priorities (Sutton and Barto 1998). For instance, in Dowling et al. (2005), a node $i$ chooses its next-hop neighbor node $a_t^i \in A$ using Boltzmann distribution with probability:

$$\begin{aligned} P\left( {s_t^i ,a_t^i } \right)= \frac{e^{-Q_t^i \left( {s_t^i ,a_t^i } \right)/ T}}{\mathop \sum \nolimits _{a \in A} e^{-Q_t^i \left.\left( {s_t^i ,a} \right)\right. T}} \end{aligned}$$

(2)

where $A$ represents a set of node $i$’s neighbor nodes; and $T$ is the temperature factor that determines the level of exploration. Higher $T$ value indicates higher possibility of exploring non-optimal routes, whereas lower $T$ value indicates higher possibility of exploiting optimal routes.

In Bhorkar et al. (2012), a node $i$ chooses its routing decision $a_t^i $ based on the historical information as follows:

$$\begin{aligned} \varepsilon \left(s_t^i \right)=\frac{1}{c_t^i \left( {s_t^i } \right)+1} \end{aligned}$$

(3)

where $c_t^i ({s_t^i })$ is a counter that represents the number of successful packet transmissions from node $i$ to next-hop neighbor node $s_t^i \in S^{i}$ until time $t$. Subsequently, with probability $1-\varepsilon (s_t^i )$, node $i$ chooses its routing decision $a_t^i =\text{ argmax}_{a \in A(s_t^i )} Q_t^i ( {s_t^i , a})$, while with a smaller probability $\varepsilon (s_t^i )$, node $i$ chooses its routing decision $a_t^i \in A(s_t^i )$ equally with probability $\varepsilon (s_t^i )/|A({s_t^i })|$.

In Liang et al. (2008), a node $i$ adjusts its level of exploration according to the level of dynamicity in the operating environment, particularly node mobility. The node computes the exploration probability as follows:

$$\begin{aligned} \varepsilon _i =\frac{n_i^{a,T} +n_i^{d,T} }{n_i^T } \end{aligned}$$

(4)

where $n_i^{a,T} $ and $n_i^{d,T} $ are the number of nodes that appear and disappear within node $i$’s transmission range, respectively; $n_i^T $ is the number of node $i$’s neighbor nodes; and $T$ is a time window. Hence, higher $\varepsilon _i $ indicates a highly mobile network, and so RL requires more explorations.

2.3 Q-learning algorithm

Figure 1 shows the traditional Q-learning algorithm presented in Sects. 2.1 and 2.2.

3 Routing in distributed wireless networks

Routing is a key component in distributed wireless networks that enables a source node to search and establish route(s) to the destination node through a set of intermediate nodes. The objectives of the routing schemes are mainly dependent on the type of operating environment and the underlying network, particularly its characteristics and requirements. This section reviews the concepts of routing in various types of distributed wireless networks, and the advantages brought about by RL to routing. With respect to routing, Sect. 3.1 reviews several major types of distributed wireless networks, particularly network characteristics, routing challenges, and the advantages brought about by RL to routing. Section 3.2 provides an overview of the application of RL to routing in distributed wireless networks, and a general formulation of the routing problem using RL.

3.1 Types of distributed wireless networks

This section presents four types of distributed wireless networks, namely wireless ad hoc networks, wireless sensor networks, cognitive radio networks, and delay tolerant networks. Table 1 summarizes each type of these networks.

Table 1 Characteristics of distributed wireless networks

Application of reinforcement learning to routing in distributed wireless networks: a review

Abstract

Similar content being viewed by others

A Model-Based Reinforcement Learning Algorithm for Routing in Energy Harvesting Mobile Ad-Hoc Networks

RLProph: a dynamic programming based reinforcement learning approach for optimal routing in opportunistic IoT networks

A novel algorithm for wireless sensor network routing protocols based on reinforcement learning

Explore related subjects

1 Introduction

2 Reinforcement learning

2.1 Q-learning model

2.2 Action selection: exploitation or exploration

2.3 Q-learning algorithm

3 Routing in distributed wireless networks

3.1 Types of distributed wireless networks

3.1.1 Wireless ad hoc networks

3.1.2 Wireless sensor networks

3.1.3 Cognitive radio networks

3.1.4 Delay tolerant networks

3.2 RL in the context of routing in distributed wireless networks

4 Reinforcement learning models for routing

4.1 Q-routing model

4.2 Multi-agent reinforcement learning model

4.3 Partially observable Markov decision process model

5 New features

5.1 Achieving balance between exploitation and exploration

5.2 Achieving higher convergence rate

5.3 Detecting the convergence of Q-values

5.4 Storing Q-values efficiently

5.5 Application of rules

5.6 Approximation of the initial Q-values

6 Application of reinforcement learning to routing in distributed wireless networks

6.1 Q-routing model

6.1.1 Q-routing approach with forward and backward exploration

6.1.2 Q-routing approach with dynamic discount factor

6.1.3 Q-routing approach with learning rate adjustment

6.1.4 Q-routing approach with Q-values equivalent to rewards

6.1.5 Q-routing approach with average Q-values

6.1.6 Q-routing approach based on on-policy Monte Carlo

6.1.7 Q-routing approach based on model construction and update

6.1.8 Q-routing approach with actor-critic method

6.2 Multi-agent reinforcement learning (MARL) model

6.2.1 MARL approach with reward exchange

6.2.2 MARL approach with Q-value exchange

6.2.3 MARL approach with decay function

6.3 Partial observable Markov decision process (POMDP) model

7 Implementation of routing using reinforcement learning in wireless platform

8 Performance enhancements

9 Open issues

9.1 Suitability and comparisons of action selection approaches

9.2 Degradation of optimal route due to exploitation

9.3 Multi-agent RL approaches

9.4 Exploration with stability enhancement

9.5 Application of events to routing

9.6 Lack of the implementation of RL-based routing schemes on wireless platform

10 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation