1 Introduction

A MANET refers to a wireless network which can be dynamically configured without any infrastructures. Features such as: dynamic topology [1], high mobility [2, 3] of nodes, low bandwidth of channel for packet transmission and the limitation of energy resource distinguish MANETs from other networks [46]. These features highlight the need and requirement for designing new methods and operations for routing protocols in MANETs. MANETs have many significant applications in different fields such as military functions, emergency searches and critical operations [3, 7]. Routing protocols in MANETs should adapt themselves rapidly with frequent changes and unpredictable topology [2] and should consume the processing and communication resources optimally. MANETs encounter numerous issues which are caused by the inherent design [8]. Achieving high throughput while discovering the best end-to-end path in multi-hop networks from source to destination node is of high significance [9]. Due to the unsustainable topology of MANETs and since the movement of nodes is self-organized and not centrally controlled, multi-objective routing [10, 11] is regarded as one of the main challenges in MANETs. Because of the movement of nodes, the created routes among nodes are unsustainable and such unsustainability not only increases packet delivery time but also wastes energy resources and partitions may occur [12, 13]. Indeed, it can be argued that the majority of challenges in MANET are attributed to the topology dynamicity [1] of these networks.

In this paper, an attempt was made to propose an optimal method by predicting the behavioral patterns of nodes and reducing packet transmission delay. To the best of our knowledge, few studies have used reinforcement learning (RL) for routing in MANETs. Hence, this research gap was addressed in the present study where a method was put forth for MANETs based on reinforcement learning and Q-learning algorithm [14]. The proposed method (RL) relied on the local data of the neighboring nodes and it did not make any assumptions about the environment. Taking the parameters of sustainability and route shortness into account, we used the reinforcement learning based on trial and error to propose a method which can choose the best alternative among all the neighbors for transmitting a packet to the target. In other words, using the reinforcement learning, we tried to predict the behavioral patterns of nodes in relation to the target node. Indeed, the proposed method estimates the movement pattern of nodes indirectly using Q-learning which is considered to be a robust algorithm in the area of reinforcement learning. In case a packet reaches the target, the method will allocate an award for the respective action; updating the value of actions is based on Q-learning. As the number of packet transmission and reception increases in the network and as the actions are repeatedly selected, regarding the number theory, the value of actions get closer to their own real values. Inasmuch as the topology is dynamic and the purpose is to select the best action in any given time, it should not be assumed that a delay in packet transmission should converge to a fixed value. In line with the purpose of making a compromise between exploration and exploitation, the proposed method used HELLO packets periodically. This action makes it possible to estimate the values of actions which have not been selected so far. Furthermore, the method proposed in this paper uses the packet delivery success rate for selecting an action.

The rest of the paper is organized as follows: Sect. 2 briefly classifies and reviews the related works in this area. Section 3 describes reinforcement learning scheme. Section 4 describes the proposed method and Sect. 5 reports the implementation, comparison and simulation results of the study. Finally, Sect. 6 concludes the study and recommends directions for further research.

2 Related works

Routing algorithms for MANETs are divided into seven groups based on their underlying architectural framework.

2.1 Table-driven or proactive routing protocols

Proactive protocols always maintain up-to-date routes from each node to every other node in the network. Routing information is stored in the routing table of each node. These types of protocols are not suitable for highly dynamic networks due to excessive control overhead generated to keep the routing tables fresh for each node in the network. The advantage of table-driven routing protocol is the short response time in determining a good route due to the up to date network topology in each node. This short response time, however, is at the expense of consuming a large portion of network bandwidth for the non-productive control packets to maintain network overview at each node [15]. Protocols such as DSDV [16] and OLSR [17] fall into this category. In the ant colony-based routing algorithm, like ant-net-DSR, backward nodes are transmitted so as to search for path. However, as these backward nodes pass each node, they leave some positive pheromone in the routing chart of the nodes. While routing, the nodes with more remaining pheromone are more likely to be selected [18]. In mobile agent routing, each of the information packets functions as an agent which collects network information. In this algorithm, the agents keep a history of a fixed size which includes nodes whose agents have been visited by them. When these agents reach the target, the target node examines the history of the agent; in case the route in the agent history is better than the route included in the target node’ routing table, the table will update itself [19].

2.2 On demand or reactive routing protocols

Reactive routing protocols only keep up the required routing for the nodes. In this category, the route is created only when the source requests a route to a destination. The path is created though broadcasting RREQ message and receiving RREPLY. In a reactive routing protocol, a node does not need to periodically broadcast the routing table thereby improving network bandwidth. Protocols such as AODV [20], DSR [21] and TORA [22] fall into this category.

Dvir and Vasilakos [3] proposed an alternative, highly agile and dynamic backpressure routing for DTNs, in which routing and forwarding decisions are made on a per-packet basis. In this routing scheme using information about queue backlogs, random walk and data packet scheduling nodes can make packet routing and forwarding decisions without the notion of end-to-end routes. Simulation results show that this scheme has advantages in terms of DTN networks. Zeng et al. [10] proposed a direction routing and scheduling scheme (DRSS) for green vehicle delay tolerant networks by using Nash Q-Learning technique. It optimizes the energy efficiency and packet deliver ratio with the considerations of congestion, buffer and delay. It uses a self-learning method and selects optimal routes within some preferred areas, which helps to route packets efficiently towards the destination. Dowling et al. [23] used collaborative learning to determine the traffic of neighbor channels and used the Boltzmann equation as the policy and criterion for selecting an action.

2.3 Hybrid routing protocols

The hybrid routing methods such as zone routing protocol (ZRP) [24] combine elements of table-driven and on-demand routing protocols. By appropriately combining these two approaches the system can achieve a higher overall performance. In [25] the challenges and design issues for different routing metrics are discussed, along with a scheme to develop hybrid routing metrics by combining different metrics together. Protocols like ZRP [24] reduce both the typical delays of the reactive protocols and the communication overhead introduced by the proactive protocols.

2.4 Location-aware (geographical)routing algorithms

Location-aware routing protocols such as [26] assume that the individual nodes are aware of the locations of all the nodes within the network. This location information is then utilized by the routing protocol to determine the optimum routes.

2.5 Multipath routing algorithms

Multipath routing protocols such as [27] create multiple routes from source to destination. The main advantages of this scheme is that the bandwidth between links is used more effectively with greater delivery reliability.

FMRM algorithm [28] uses the fuzzy logic to select the route. This algorithm uses fuzzy controller. Indeed, using fuzzy input parameters, FMRM [28] obtains the expiration time, packet delivery rate and queue length for each of the neighbors which have a route to the destination. Hence, it obtains a priority for each of the nodes in relation to the transmitted packet. Then, based on the priorities, it transmits the packet. The values of each of the input parameters are classified into low, medium, high and very high.

2.6 Hierarchical routing algorithms

Hierarchical routing protocols such as [29] build a hierarchy of nodes, typically through clustering techniques. Cluster heads (CHs) provide data aggregation and data fusion, improving the scalability and the efficiency of routing. Vasilakos et al. [30] proposed the use of a computational intelligence approach to a Reinforcement Learning Algorithm (RLA) for optimizing the routing in ATM and networks based on the Private Network-to-Network Interface (PNNI) standard. The purpose of the solution addresses the QoS issues related to routing, where network resources are allocated wisely to ensure the QoS requirement should be met for each connection available in the network. Through RLA protocol optimizes the various parameters of hierarchical network structure i.e. computation, communication and storage requirements needed for routing. This algorithm, aims at maximizing the network revenue while ensuring the QoS requirements for each connection.

2.7 QoS-aware routing algorithms

QoS-aware routing protocols such as [31] consider QoS parameters such as residual energy, delay, static resources capacity [32], dynamic resources availability [32], neighborhood quality [32], and link quality and stability in route discovery path. Many QoS-based routing algorithms have been proposed for MANETs [1012, 25, 3342]. Network lifetime and average end-to-end delay are important QoS parameters for a MANET. In order to avoid network partitioning, it is important to maximize the network lifetime before the nodes fail because of battery exhaustion. A routing protocol which allows to reduce the average end-to-end delay is desirable for all real-time applications [43]. In this sub-section we present some of the latter works, by analyzing their main features and drawbacks. Since this work is specifically concerned with the real-time routing protocol, we cover in more detail extensions of these protocols.

Yen et al. [2], proposed a multi-constrained QoS multicast routing scheme using genetic algorithm (GA) with considering available resources and minimum computation time in a dynamic environment. By selecting the appropriate values for parameters such as crossover, mutation, and population size, the GA improves and tries to optimize the routes. This protocol forms a multicast tree and calculate total delay and residual energy of all nodes in the tree. A high fitness value minimizes the delay and maximizes the residual power in the tree. Simulation results demonstrate the superiority of the proposed scheme over other routing methods. Vasilakos et al. [44] proposed very emerging area Information Centric Network (ICN) with the benefits of implementing such network and various open research issues. In [4], the authors proposed a reliable multicast protocol, called CodePipe, with advanced performance in terms of energy efficiency [45, 46], throughput, and fairness in lossy wireless networks. Built upon opportunistic routing and random linear network coding, CodePipe not only simplifies transmission coordination between nodes but also improves energy efficiency, fairness and the multicast throughput significantly by exploiting both intra-batch and inter-batch coding opportunities. CodePipe is able to build a reliable data delivery mechanism in a lossy wireless network. CodePipe was evaluated on NS2 simulator by comparing with MORE and Pacifier. Ghaffari [47] proposed an energy-efficient routing protocol for wireless sensor networks (WSNs) using A-star algorithm. In this scheme, cost function considers parameters such as residual energy, free buffer and link quality of the next neighbor node for selecting the optimal path. This scheme leads to the optimization of energy consumption and packet delivery ratio. Meng et al. [9] proposed single path and any path routing schemes using the spatial reusability method. This provides high end-to-end throughput. Vasilakos et al. in [13] presented a systematic exploration of DTN. In this book, various chapter contributed by eminent authors cover various aspects and open areas of future research in this domain. In this book various issues related to MANET such as networking, wireless, mobile communications, and technology analysis, an energy-aware routing protocol for DTNs, and a routing-compatible credit-based incentive scheme for DTNs, mobile peer-to-peer systems over DTNs, delay tolerant monitoring of obility-assisted are discussed. They also contributed chapters on various DTNs in satellite, vehicular, mobile, space and wireless sensor network communication. A biological model of Physarum [33, 48], is used for designing a novel biology-inspired optimization algorithm for minimal exposure problem (MEP). First, formulate MEP and the related models and then convert MEP into the Steiner problem by discretizing the monitoring field to a large-scale weighted grid. Physarum is able to find the shortest path to a food source through a maze like structure. Their model maintains low running complexity and still achieves high level of parallelism. They demonstrate that their solution is comparable with that of classical approximation methods and is yet more efficient. Extensive simulations demonstrate that the proposed models and algorithm are effective for finding the road-network with minimal exposure and feasible for the Steiner problem [48]. Vasilakos et al. [49] proposed a EFRouter routing system which sends the data in a feasible path using fuzzy set theory and GA. They use statistical information to predict traffic load on different links. This traffic load is then used to calculate the cost of selecting a path. When a path is selected, future probability of having to refuse further connections is taken into consideration for cost calculation, using the fuzzy model. In this scheme path selection is based on quantization of per rate success probability. Shen et al. [50] proposed the various peer-to-peer media streaming systems that deployed successfully, and corresponding theoretical investigations have been performed in the system. It is responsible in their potential in balancing the load and improving system scalability. MANET with Q-routing protocol used the Q-learning algorithm for routing in wired network so as to reduce packet transmission. In this protocol, the number of hops was used as a reward for nodes [51].

The majority of algorithms proposed so far have not paid significant attention to the movement pattern of nodes or they have considered numerous assumptions and hypotheses about nodes and network topology. Using reinforcement learning can be regarded as a relatively novel idea and concept in routing MANETs. It should be noted that this issue is considered to be a research gap which should be addressed by further research. Thus, as an attempt to fill this research gap, in this paper, proposed a routing algorithm using reinforcement learning and only the local data of nodes. This algorithm was aimed at providing an optimal route for transmitting data packets.

3 Reinforcement learning

Reinforcement learning (RL) refers to a type of learning which is achieved through interaction. The learner and decision maker is called agent. The agent selects the actions and functions and the environment reacts to the agent’s action. State indicates the new position for the agent (and agent moves to new position). Indeed, based on the action and function fulfilled by the agent, the agent may enter a new state which is likely to be a former state. Also, environment gives a reward for each action of the agent; the reward might be positive or negative. The reinforcement learning issue can be expressed as Markov decision-making process. Markov process consists of four components:

  1. 1.

    The set of states {s1, s2, s3, …},

  2. 2.

    The set of actions {a1, a2, a3, …},

  3. 3.

    State transition function \(Pr_{{ss^{{\prime }} }}^{a} = \Pr \left\{ {s_{t + 1} = s^{{\prime }} |s_{t} = s,a_{t} = a} \right\}\).

  4. 4.

    Reward function \(R_{{ss^{{\prime }} }}^{a} = E\left\{ {r_{t + 1} |s_{t} = s,a_{t} = a,s_{t + 1} = s^{{\prime }} } \right\}\) and policy function \(\pi \left( s \right) \to a\).

In reinforcement learning, the purpose is to find an optimal policy for which value function is used. State value function estimate how good it is for the agent to be in a given state. Accordingly, value functions are defined with respect to particular policies. The following Eq. (1) illustrates this function:

$$V^{\pi } \left( s \right) = E_{\pi } \left\{ {R_{t} |s_{t} = s} \right\} \Rightarrow \mathop \sum \limits_{a} \pi \left( {s,a} \right)\mathop \sum \limits_{{s^{\prime}}} P_{{ss^{'} }}^{a} \left[ {R_{{ss^{'} }}^{a} + \gamma V^{\pi } \left( {s^{\prime}} \right)} \right]$$
(1)

In Eq. (1), γ denotes the effect of the value of next states in the current state (0 > γ < 1). Action value function refers to the expected total reward promotions with taking action α in which π policy is used from next states until the ending state.

$$Q^{\pi } (s,a) = \mathop \sum \limits_{{s^{\prime}}} P_{{ss^{{\prime }} }}^{a} \left[ {R_{{ss^{{\prime }} }}^{a} + \gamma V^{\pi } \left( {s^{{\prime }} } \right)} \right]$$
(2)

In Eq. (2), one of the actions is selected based on assumption. The optimal action value function which is independent of policy is defined as follows:

$$Q^{*} \left( {s,a} \right) = \mathop \sum \limits_{{s^{\prime } }} P_{{ss^{\prime } }}^{a} \left( {R_{{ss^{\prime } }}^{a} + \gamma \max_{{a^{\prime } }} Q^{*} \left( {s^{\prime } ,a^{\prime } } \right)} \right)$$
(3)

In Eq. (3), we should select an action which has the highest value.

4 The proposed method

Based on the chaos-complexity theory, it can be argued that an order and discipline can be defined for the majority of environments. We can seldom find order and discipline in environments and organizations where humans play the major roles. Hence, a method has been proposed in this paper which is intended to estimate the behavioral pattern of the nodes indirectly; this method uses Q-learning and selects neighboring nodes with little transmission delay in order to reduce the packet transmission time. A set of states was defined for the proposed method.

In this paper, the states of nodes were defined with regard to the target node.

The node which should transmit a packet to another node (ID = 10) is in state 10. For defining actions, the nodes are divided into 5 groups based on their IDs. For example in Fig. 1, nodes with the ID = 1, 2, 3, 4, 5 are within one group. Indeed, in the majority of previously proposed methods, actions are the nodes themselves; however, in the method proposed in this paper, actions are the groups. Firstly, the node selects one node among the available nodes for transmitting data packet. The purpose of grouping nodes is not only to reduce the number of actions but also to rank the value of actions. That is, while selecting actions, the agent firstly selects a group as the transmitter group from among the groups which are at the first level. In selecting groups, the action values of the groups are used. The values of actions are an estimate of the delay which is obtained from the available nodes in the group. In addition to the delay, the success rate of the group in transmitting the packet is also used to estimate the value of groups. After the selection of a group which has the best performance in transmitting packets, the node with the lowest number of proposed hops is selected as the next node to transmit packet. In this way, the probability of the selection of a node with better performance is enhanced. The policy used for selecting an action is based on the following Boltzmann probability distribution:

$$P\left( {s,a} \right) = e^{{\frac{{Q\left[ {s,a} \right]}}{T}}} /\mathop \sum \nolimits_{a} e^{{\frac{{Q\left[ {s,a} \right]}}{T}}}$$
(4)
Fig. 1
figure 1

Defining state for the proposed method

In Eq. (4), Q(s, a) is the action value function. The value of actions (groups) for transmitting packet depends on the value of the nodes which are within that group. That is, the value of groups is a product of the value of the nodes of the groups. After the selection of the group to transmit a packet, the node for transmission is selected based on the number of hops offered by the nodes to transmit a packet to the target. The node offering less number of hops is selected as the next hop for packet transmission. The pseudo-code related to route selection section is given as follows:

figure b

The value of actions is calculated by means of the Q-learning updating equation.

$$T_{D} = {\text{T}}_{\text{D}} +\upalpha*\left( {{\text{Rt}} + \left( {\upgamma*{\text{Min}}_{\text{Value}} } \right) - {\text{T}}_{\text{D}} } \right)$$
(5)
$${\text{P}}_{\text{r}} = {\text{P}}_{\text{r}} +\upalpha*\left( {{\text{Rt}} + \left( {\upomega*{\text{Max}}_{\text{Value}} } \right) - {\text{P}}_{\text{r}} } \right)$$
(6)

Equation (5) is used for updating transmission delay and Eq. (6) is used for updating success packet deliver rate.

It should be noted that in addition to the transmission time, the success rate in delivering a packet to the destination is taken into consideration in selecting the group to transmit packets. According to the selection policy, the group with a higher action value is more likely to be selected. Hence, the proposed algorithm cannot be expected to select the best action at any moment; this problem is considered as a drawback for the algorithm. Figure 2 depicts the internal structure of the routing module which is the most important section of the routing algorithm.

Fig. 2
figure 2

The internal structure of the routing module

As shown in Fig. 2, the routing module consists of several states. At any moment, nodes can be in one of the states. State 1 is used for node’s decision about transition to the next state. For instance, when a packet is produced for transmission, in state 1, the node makes a decision based on the received signal to enter state 2. In state 2, the node makes a decision to conduct the route discovery operation or it decides to go towards the available target node and do the routing operation. In state 3, the node conducts the route discovery operation; then, it goes back to state 1 and waits for the input signals so that it can make the appropriate decision. In state 4, the node conducts the routing operation under the condition where it is the target node; then, it goes to state 1. In state 5, the node receives the RREPLY and updates the routing table. In case the RREPLY packet is intended for the node, after updating the routing table, the node will go to sate 4 for routing. The node will be in state 6 if it can operate as the interface node so as to transmit the packet to the target. In state 7, the node receives the packet Acknowledge and carries out the necessary updates for the value of the actions. When the required time for the return of the packet RREPLY is spent but the packet is not received, the node will go to state 8.

While transmitting route discovery packets and data packets, the source node produces a list called node path list and puts its own address at the beginning of the list. The node path list keeps the addresses of the nodes through which packets have passed. As each middle node receives these packets, in case a packet is not for the node itself, it adds its address to the end of the list and transmits the packet. In case the packet is for the node, the node inverts the node path list and transmits the packet to the first node in the node path list. Also, as each node receives RREPLY packet and acknowledge packet, they take their own addresses from the beginning of the list and transmits it to the next node in the inverted node path list until the packet is delivered to the destination node. The pseudo-code related to Fig. 2 is given below:

figure c

5 Performance evaluation of the proposed method

For evaluating the efficiency of the proposed method in transmitting the packet, it was compared with the algorithms E-Ant-DSR [19], dynamic source routing (DSR) [52] and ant-colony based routing algorithm (ARA) [53]. The simulation environment OPNET 14.5 was used. Tables 1 and 2 show the simulation parameters and their values for first and second scenarios respectively.

Table 1 Simulation parameters of the first scenario
Table 2 Simulation parameters of the second scenario

In the first scenario, as shown in Fig. 3, the proposed algorithm is relatively more sustainable than the DSR [52] algorithm; at the beginning, the algorithm had more delay than DSR [52]. This can be attributed to the policy of selecting the actions. In fact, the selected policy was based on probability and at the beginning of the simulation, the probabilities are equal and all the actions are equally probable for being selected. As the algorithm time passes, the prediction of the neighbor values is obtained and the delay time in transmitting the packet is improved.

Fig. 3
figure 3

Average end-to-end delay (first scenario)

Also, as shown in Fig. 4, it can be observed that, in an environment relatively sustainable in terms of topology, the algorithm proposed in this paper performs better than the optimal ant colony routing algorithm with respect to compatibility. Figure 4 indicates that as the number of nodes increases and even when the environed is more sustainable, the proposed algorithm functions better than DSR [52] and ARA [53]. It should be noted that as the number of nodes increases, the number of possible actions for the nodes increases; however, even in such conditions, the proposed algorithm performs better than DSR [52] and ARA [53].

Fig. 4
figure 4

Packet delivery rate

In the second scenario, in addition to the enhancement of the stop time parameter to 10 s, for increasing the movement of nodes, the speed of the nodes was enhanced to 0–10 m/s. The simulation results are given in Fig. 4.

Figure 5 reveals that the proposed algorithm works better than the DSR algorithm in an environment which is less sustainable. With respect to the movement of nodes and many changes in terms of location, it can be argued that the proposed algorithm has significantly more efficiency than the DSR [52] algorithm. The proposed algorithm was compared with ARA [53] and DSR [52] in an instable environment in terms of packet delivery rate. In this scenario, the agreement time of nodes was reduced to a range from 0 to 5 s and the movement speed was enhanced to the range from 0 to 10 m/s.

Fig. 5
figure 5

Average end-to-end delay

As shown in Fig. 6, it can be observed that, with regard to the instability of route, the proposed algorithm has higher efficiency than ARA [53]. This can be attributed to the criterion used in the proposed algorithm for selecting the next step and the slowness of ARA [53] in adapting itself to the new topology. Noticing this figure reveals that the proposed algorithm functions better than the optimal ant colony routing algorithm in a relatively sustainable environment. Since congestion [54] is created in the interface nodes as a function of the route selection policy in the ant colony algorithm, it can be pointed out that the proposed algorithm has a better performance.

Fig. 6
figure 6

Packet delivery ratio

In the second scenario, the proposed algorithm was compared with the algorithms which used neural network for routing. It was also compared with the one which was based GA. Moreover, the proposed algorithm was compared with the one put forth by [19] which was composed of a combination of DSR [52] and ACO algorithms.

As illustrated in Fig. 7, it can be noted that, in a stable environment, the proposed algorithm has higher efficiency than ANN. Since the proposed algorithm uses clustering [45, 55] to specify the value of actions, it can provide a better estimation of the value of actions. Furthermore, the proposed algorithm uses the law of large numbers to predict the value of actions. Consequently, the values obtained by the proposed algorithm is more real and valid. The ANN algorithm uses neural networks to determine the value of actions where the estimations are based on the previous estimations. The GA merely considers the topology of the environment in selecting the route. Inasmuch as the environment has a dynamic topology, it is prone to error in selecting the optimal route. The E-Ant-DSR [19] algorithm selects the route based on the parameters which vary with time. Consequently, E-Ant-DSR [19] algorithm cannot have a better adaptability in line with the topological changes which leads to more delay in delivering a packet to the target.

Fig. 7
figure 7

Average end-to-end delay

As shown in Fig. 8, the proposed algorithm indicates a better packet delivery rate than the other algorithms. It should be noted that the proposed algorithm uses probability to select actions. Since E-Ant-DSR [19] uses ant colony, it operates slower than the proposed algorithm in adapting itself with the constant changes of the environment. As a result, the packet delivery rate is reduced in E_Anet_DSR [19] algorithm. Moreover, inasmuch as FMRM [28] uses parameters which vary with time in selecting the next nodes, its efficiency decreases.

Fig. 8
figure 8

Packet delivery rate

6 Conclusions and future works

In this paper, an algorithm based on reinforcement learning was proposed which was based on local information. The obtained results illustrated through the figures indicated that reinforcement learning is a promising concept in MANETs. Furthermore, as mentioned earlier in the paper, it should be highlighted that the proposed algorithm had no assumptions about the environment and it relied only on the local information of nodes obtained from the neighbors. This approach can be considered and followed by future studies. As simulation results showed, as the time passes, delay decreases in relate on to the initial times; this feature can be further examined in future studies. Also, packet delivery rate obtained for the proposed algorithm was promising. It should be pointed that both packet delivery rate and time can be further improved by the selection of better policies. These research agenda should be all addressed by interested future researchers. The following recommendations can be considered in the future studies:

  • Action selection policy in the proposed algorithm was based on Boltzmann equation and policy improvement throughout time was another critical factor in enhancing the efficiency of the algorithm. It is recommended that possible policies be examined and searched so that an optimal routing policy is selected. Future studies should examine complementary algorithms as a significant research agenda.

  • Grouping type (clustering) in the proposed algorithm was simple. Other grouping patterns can be used in future studies. In other words, using better and more optimal criteria for clustering of the nodes may enhance the efficiency of the algorithms.

  • Last but not least, better states and conditions should be defined for nodes.