1 Introduction

With the development of communication technology, the VANET (vehicular ad-hoc network) has become the most important part of building an intelligent city and an intelligent transportation system (ITS). Because it is an extension of traditional MANETs (mobile ad-hoc networks), VANET has elicited much interest from automobile manufacturers, governments, and research institutions to provide services such as traffic warning, information consulting, and entertainment; many projects in these fields have been launched [1, 2]. According to WHO reports, the number of deaths caused by traffic accidents is 1.24 million worldwide every year. World Bank statistics show that the world economic loss caused by traffic accidents is $500 billion per year [3]. With greater numbers of vehicles in cities, solving traffic congestion has become a worldwide problem. Traffic congestion can cause time delay, wasted fuel consumption, and unnecessary environmental pollution. At the same time, with the development of a social vehicle network, in-car entertainment and communication between vehicles have become necessary functions that automobile manufacturers provide to customers [4]. In VANETs, wireless communication can be divided into two types: vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V). V2V is favored by many researchers because communication does not rely on infrastructure and its deployment is flexible. However, VANET has some characteristics that are different from traditional MANETs [5]. First, in VANETs, the network nodes (cars) move at high speed, which will cause frequent changes in network topology. The short link time maintained between nodes makes the links unreliable, and network performance varies greatly with vehicle distribution. Second, vehicle nodes are restricted to roads and are affected by many factors like speed limits and traffic signals. Third, vehicles can provide enough power and computing ability for themselves, meaning that the energy limitations of MANET are no longer an important factor.

As an important support technology to realize an intelligent transportation system, the design of an efficient routing protocol has become an important step in implementing VANET applications. Because the VANET topology changes frequently and the links are unreliable, traditional routing algorithms based on MANETs, such as the Ad-hoc On-demand Distance Vector (AODV) [6] and Dynamic Source Routing (DSR) [7], are difficult to use with VANETs. These algorithms have a lower delivery ratio and a longer transmission delay because they do not consider rapid topology changes. To overcome this problem, some routing algorithms based on geographic location, such as TFOR [8] and GyTAR [9], have been proposed recently [10]. These routing algorithms are designed based on traffic flow estimation, use GPS to locate the destination node, and ignore topology changes, but they do not consider link reliability between nodes. Some reliable routing algorithms have been proposed, such as SLBF [11] and EG-RAODV [12]. These algorithms were designed by adding parameters such as link reliability between nodes and packet error rate. This approach can improve the packet delivery ratio to a certain extent, but cannot guarantee the time delay. In VANETs, a rapid change in velocity is the main cause of topology changes and unreliable links. Only a full understanding of vehicle motion characteristics can lead to effective evaluation of link reliability and efficient routing algorithm designs.

Assuming that vehicles are moving on a fixed route, various factors can break the link between two nodes, such as the traveling direction of the two vehicles, the distance between vehicles, speed, and acceleration. To evaluate link reliability between nodes effectively, these factors should be fully considered, especially given the low-density, high-speed environment. The distance between vehicles plays an important role in ensuring traffic safety and evaluating link reliability. Many models of distance between vehicles have been proposed, such as the exponential distribution model, the normal distribution model, and the lognormal distribution. In [13], the authors proved by experiments that the distance between vehicles can be more consistent with real traffic flow and safe distances when it follows a lognormal distribution. A full understanding of the traffic flow model and the motion characteristics of vehicles can help evaluate link reliability between nodes effectively. With ongoing studies of routing algorithms in recent years, some intelligent algorithms [14] have been applied to VANETs and have yielded better results than traditional routing algorithms. As a self-learning algorithm, the Q-Learning algorithm [15] can find the shortest path from a source node to a destination node through constant interaction with its environment. Based on this idea, in this paper, a reliable adaptive routing algorithm (RSAR) is proposed by improving the Q-Learning algorithm. It can adaptively adjust the Q-table while ensuring the reliability of each hop link to adapt to a dynamic network topology such as VANET.

The main purpose of this paper is to propose a reliable and adaptive routing algorithm. The reliability of an entire link depends on the links between each hop. This paper establishes a reliable model of a link between nodes by a detailed study of the motion properties of vehicles. Once the probability of link reliability has been calculated, it can be used as a parameter in the Q-Learning algorithm to design the RSAR algorithm.

This paper is organized as follows. In the second section, the routing algorithm is simply classified and partly analyzed. In the third section, the development of a reliable link evaluation model is described. The fourth section presents the main concept of the RSAR algorithm and the development of the model. The fifth section describes the simulation and analysis performed. The sixth and final section provides a summary and suggestions for future work.

2 Related work

2.1 Routing protocol classification

In VANETs, the routing algorithms can be divided into two main types: one based on topology, the other based on geographical location. Routing algorithms based on topology can be divided into reactive (on-demand) routing algorithms and active (table-driven) routing algorithms. In a reactive routing protocol, nodes establish the route according to their requirements and maintain an end-to-end path. Representative routing algorithms include AODV and DSR. Some algorithms like EG-RAODV [12] and QLAODV [15] have been derived on this basis and are more suitable to VANETs. In an active routing algorithm like DSDV and OLSR [16], nodes maintain the topology of the whole network through exchanging the routing table. In a rapidly changing VANET, a fast change in topology is the only factor that affects the active routing algorithm. In a protocol based on geographical location, a GPS is provided for each vehicle so that its coordinates, speed, and direction can be accurately determined. The vehicle does not need to maintain the path from the source to the destination node, and each node maintains only one neighbor table to record the location and speed of neighbor nodes. Routing algorithms based on geographical location can be divided into two kinds. Algorithms in which the sending node makes the routing decision actively are called sending node-based forwarding (SBF). Algorithms in which the receiving node determines whether it can be the next segment of a link are called receiving node-based forwarding (RBF). Among SBF-based routing mechanisms, the most representative is the GPSR [17] routing. In [18], the STAR algorithm is proposed. These are both SBF-based routing algorithms. In RBF-based routing, sending nodes send data packets by broadcasting them. When receiving nodes receive broadcast data packets, they compete for the right to forward them. The nodes that obtain the forwarding right become the next hop in a link and suppress other nodes from forwarding packets. SLBF [11] is a typical RBF algorithm that uses node locations to calculate waiting time; the nodes that have the shortest waiting time have the right to forward packets. Vegni [19] proposed a clustering-based RBF algorithm that reduces the number of forwarding nodes to suppress redundant packets.

2.2 Link evaluation algorithm

According to routing algorithm classification research, many routing algorithms for VANETs have been proposed, but algorithms that predict link reliability according to motion characteristics of vehicles are rare. In VANETs, it is necessary to ensure link reliability, which guarantees the entire path’s validity. In [20], the authors predicted link reliability and routing path according to signal strength. In [21], the authors developed a mobile model to estimate the duration of an n-node path, in which they considered speed, but not acceleration. The MORP [22] algorithm based on motion prediction has also been proposed. MORP predicts the future position of the vehicle and the stability of the link. By using location, direction, and speed to predict link state, MORP selects the most stable end-to-end routing path. A prediction-based routing algorithm [23] has been proposed for VANETs and adjusts the routing path by predicting the life cycle of the link, which is predicted by communication range, vehicle position, and corresponding speed. Considering that the routing path consists of many links, the minimum-duration link determines the maximum survival time of the whole link, so that the maximum link duration of the path can be effectively predicted.

2.3 Q-learning algorithm

The Q-Learning algorithm [24] is a reinforcement learning algorithm, which is proposed in a machine learning context. The learning process is such that in different states, the agent finds a path to reach the destination node with the maximum return value by periodically updating its state-activity value (Q-Value). Q-Learning is an unsupervised active learning algorithm, which does not require a specific system model and can adapt to different environments by interacting with its surroundings. In the Q-Learning algorithm, the Q-value Q(s, a)(s ∈ S, a ∈ A) represents a reward value that a learning agent converted from state s to another state through an activity a, where s represents a state and a represents an activity. Because a state closer to the destination will receive a higher reward value, the distance between the state and the destination can be effectively determined according to the reward value. As the agent continuously converts activities to find the path to the destination node in different states, it can ultimately find the shortest path from the source node to the destination node. This intelligent algorithm is beginning to be used in VANETs. Some researchers [25,26,27,28] have proposed an algorithm by combining fuzzy logic with Q-Learning (FQ-OPP); this algorithm can be adapted to dynamic topology changes. References [29,30,31,32] proposed a QGrid algorithm based on Q-Learning; they divided the entire region into grid cells and selected the next hop forwarding node by calculating the Q-value from both macro and micro aspects.

In addition, recent related studies are described in [33,34,35,36,37,38,39]. For example, reference [33] investigated a centralized model and a heuristic algorithm for solving the multi-depot logistics delivery problem, including depot selection and shared commodity delivery. Reference [37] concentrated on introducing an adaptive cooperative caching strategy with novel cache replacement and prefetching policies, which divided the network into non-overlapping clusters. Reference [39] proposed a novel filter model based on a hidden Markov model (HMM) (FM-HMM) for an intrusion detection system to reduce the overhead and time required for detection without impairing the detection rate. Moreover, relative comparisons of results among these methods have been done.

3 Link reliability model

On a highway, the fast movement of vehicles makes links very unreliable. Designing a reliable routing algorithm is a challenging task. The movement of vehicles is the main factor that makes the links unreliable. To evaluate link reliability between nodes effectively, a capable system model and link duration model must be developed. The first step is to develop the system model based on vehicle motion characteristics; then the duration of links between nodes can be evaluated, leading to construction of a reliable link model.

3.1 System model

To evaluate link quality between nodes effectively, assuming that the highway is generally a straight road as in [40, 41] and that the broadcasting distance is much longer than the width of the road, it can be further assumed that the width of the road has a very small influence on selecting the next hop forwarding node, and therefore the width of the road can be ignored. Figure 1 shows the highway model used. It can further be assumed that acceleration, deceleration, changing lanes, overtaking, and other maneuvers are taking place on the road. Besides, the distance between vehicles is log-normally distributed [16], that is, Xi ∈ log N(μi, δi). As shown in Fig. 1, Xi = {Xi(m), m = 0, 1, 2, ⋯} is a random variable that is log-normally distributed. Xi represents the distance between vehicles i and i + 1, and Xi(m) is a random variable that represents the distance between node i and other nodes at time m. In Fig. 1, if node Vs is regarded as the reference node, then X represents the distance between Vs and any other node, where \( X=\sum \limits_{i=1}^m{X}_i \), so X is also log-normally distributed [42].

Fig. 1
figure 1

Road model

Lemma 1

Assuming that X ∈ log N(μ, δ), the random variable \( T=\sqrt{aX+b}+c \) is log-normally distributed, where a, b, c ∈ R, a, b, c ≠ 0 and aX + b ≥ 0.

Proof

Let GT be the probability distribution function of T. For every positive t, GT(t) = Pr[{T ≤ t}]. Obviously, because T is continuous,

$$ {\displaystyle \begin{array}{l}{G}_T(t)=\Pr \left[\left\{T\le t\right\}\right]\\ {}\kern2.3em =\Pr \left[\left\{\sqrt{aX+b}+c\le t\right\}\right]\\ {}\kern2.3em =\Pr \left[\left\{ aX\le {\left(t-c\right)}^2-b\right\}\right]\\ {}\kern2.3em =\Big\{\begin{array}{cc}{F}_X\left(\frac{{\left(t-c\right)}^2-b}{a}\right)& \mathrm{when}\kern0.5em a>0\\ {}1-{F}_X\left(\frac{{\left(t-c\right)}^2-b}{a}\right)& \mathrm{when}\kern0.5em a<0\end{array}\end{array}} $$

where FX is the probability distribution function of X. When a > 0, it is clear that T is log-normally distributed. When a < 0, according to [13], \( F(z)=\frac{1}{2}+\frac{1}{2}\mathit{\operatorname{erf}}\left(\frac{z-{\mu}_z}{\sigma_z\sqrt{2}}\right) \), and letting z = ((t − c)2 − b/a),

$$ {\displaystyle \begin{array}{l}1-{F}_X(z)=\frac{1}{2}-\frac{1}{2}\mathit{\operatorname{erf}}\left(\frac{\ln z-\mu (X)}{\sigma (X)\sqrt{2}}\right)\\ {}\kern3.57em ={F}_Y\left(\frac{a}{{\left(t-c\right)}^2-b}\right)\end{array}} $$

where Y is a log-normal random variable with parameters −μ(X) and σ(X). Note that the fact that − erf (x) =  erf (−x) is used. Hence, T obeys a log-normal distribution, completing the proof.

Lemma 2

Assuming that X ∈ log N(μ, σ), the random variable T = aX + b is log-normally distributed, where a, b, c ∈ R and a, b, c ≠ 0.

This lemma can be easily proved using the arguments adduced in the proof of Lemma 1.

3.2 Link duration model

In Fig. 2, “a” shows two vehicles moving in the same direction, and “b” shows two vehicles moving in opposite directions, where the black arrows indicate the direction of motion.

Fig. 2
figure 2

Two cases of link breakage

Assuming that the vehicles are moving on a fixed road, there are two main conditions (shown in Fig. 2) under which the link between two nodes becomes disconnected. In Figure “a”, two vehicles are moving in the same direction, and in Figure “b”, two vehicles are moving in opposite directions. Consider vehicle i as the reference node and vehicle j as the transmission node. In other cases, vehicles i and j have the longest link communication time. The next step is to analyze the link duration in detail in these two cases.

As shown in Fig. 2a, assume that at time t0 = 0, vehicle i is within the one-hop communication range of vehicle j and that vehicle i is located in front of vehicle j. The initial distance between vehicles X is a random variable. The maximum communication radius of vehicle R is a fixed constant. At the initial moment, X satisfies

$$ 0\le X<R. $$

According to the system model, there will be acceleration, deceleration and overtaking of vehicles on a highway. The maximum speed limit on the road is set to vm, i.e., all vehicles should move at a speed less than or equal to vm. Assume that the vehicle’s acceleration is a(0) and that its speed is v(0). When t ≥ 0, the acceleration is defined as a(t), and the speed is defined as v(t). At time t, the acceleration is as follows:

  1. 1)

    If a(0) = 0, for all t ≥ 0,

$$ a(t)=0. $$
(1)
  1. 2)

    If a(0) > 0,

$$ a(t)=\Big\{{\displaystyle \begin{array}{cc}a(0)& t\le \frac{v_m-v(0)}{a(0)}\\ {}0& \mathrm{otherwise}\end{array}}. $$
(2)
  1. 3)

    If a(0) < 0,

$$ a(t)=\Big\{{\displaystyle \begin{array}{cc}a(0)& t\le \frac{-v(0)}{a(0)}\\ {}0& \mathrm{otherwise}\end{array}}. $$
(3)

From the above analysis, it can be assumed that when t0 = 0, if the acceleration a(0) at this time is also 0, then the instantaneous acceleration is also 0, i.e., a(t) = 0. For Eqs. (2) and (3), if the speed is below the maximum speed limit or is reduced to 0, it is assumed that the acceleration is still a(0); otherwise, it is changed to 0. Taking the initial speed v(0) into account,at time t, the instantaneous speed v(t) is defined by the following formula:

$$ v(t)=v(0)+\underset{0}{\overset{t}{\int }}a(u) du, $$
(4)

where u ∈ [0, t], a(u) is the acceleration at time u.

Now Eqs. (14) can be combined to calculate the instantaneous speed:

  1. 1)

    If a(0) = 0, for all t ≥ 0,

$$ v(t)=v(0) $$
  1. 2)

    If a(0) > 0,

$$ v(t)=\Big\{{\displaystyle \begin{array}{cc}v(0)+a(0)t& t\le \frac{v_m-v(0)}{a(0)}\\ {}{v}_m& \mathrm{else}\end{array}} $$
(5)
  1. 3)

    If a(0) < 0,

$$ v(t)=\Big\{{\displaystyle \begin{array}{cc}v(0)+a(0)t& t\le \frac{-v(0)}{a(0)}\\ {}{v}_m& \mathrm{else}\end{array}} $$
(6)

According to the above definition, the distance that any vehicle moves at speed v(x) in time interval [0, t] is defined as:

$$ S(t)=\underset{0}{\overset{t}{\int }}v(x) dx. $$
(7)

Again according to the above definition, as Fig. 2 shows, the distance between vehicles i and j at time t can be calculated. Assuming that the initial speed and acceleration of vehicles i and j are ai(0), vi(0), aj(0), and vj(0) respectively, the instantaneous speed and acceleration of vehicles i and j at time t are ai(t), vi(t), aj(t), and vj(t). According to Eqs. (1–7), the distance between vehicles i and j in time interval [0, t] can be calculated as follows:

$$ {\displaystyle \begin{array}{l}{S}_i(t)=\underset{0}{\overset{t}{\int }}{v}_i(x) dx,\\ {}{S}_j(t)=\underset{0}{\overset{t}{\int }}{v}_j(x) dx.\end{array}} $$

When the initial distance between vehicles i and j is X, the distance di, j between i and j at time t is:

$$ {d}_{i,j}=\Big\{{\displaystyle \begin{array}{cc}{S}_j(t)+{S}_i(t)+X& \mathrm{same}\ \mathrm{direction}\\ {}{S}_j(t)-{S}_i(t)+X& \mathrm{opposite}\ \mathrm{direction}\end{array}}. $$
(8)

From Eq. (8), it is obvious that when di, j > R, the link is disconnected.

The next step is to analyze the link duration in Fig. 2b, where two vehicles are moving in opposite directions. When two vehicles meet:

$$ {S}_j(t)+{S}_i(t)+X=R. $$
(9)

The maximum link duration t can now be calculated. By assuming that \( {S}_j(t)+{S}_i(t)=\frac{1}{2}{a}_r{t}^2+{v}_rt \), where ar = ai + aj and vr = vi + vj, and by substituting this into Eq. (9), the maximum link duration t can be obtained as:

$$ t=\frac{-{v}_r+\sqrt{v_r^2+2{a}_r\left(R-X\right)}}{a_r}. $$
(10)

When two vehicles are moving in the same direction, as shown in Fig. 2a, during acceleration, deceleration, and overtaking, it is critical to determine whether vehicle i or j is in front. When Sj(t) − Si(t) + X > 0, vehicle j is located in front of vehicle i; otherwise, vehicle i is located in front of vehicle j. To express effectively which vehicle is in front, a symbolic function was defined as follows:

$$ I\left(i,j\right)=\Big\{{\displaystyle \begin{array}{cc}1& {S}_j(t)-{S}_i(t)+X>0\\ {}-1& \mathrm{otherwise}\end{array}}. $$
(11)

When the link is in a critical state of disconnection,:

$$ {S}_j(t)-{S}_i(t)+X=R\cdot I\left(i,j\right). $$
(12)

At this point, two cases must be considered to calculate the link duration.

When I(i, j) = 1, vehicle j is located in front of vehicle i. From Eq. (12), Sj(t) − Si(t) + X = R. Similarly to Eq. (10), because \( {S}_j(t)-{S}_i(t)=\frac{1}{2}{a}_r{t}^2+{v}_rt \), where ar = aj − ai and vr = vj − vi, the time t can be determined as:

$$ t=\frac{-{v}_r+\sqrt{v_r^2+2{a}_r\left(R-X\right)}}{a_r}. $$
(13)

Similarly, when I(i, j) =  − 1, vehicle i is located in front of vehicle j. From Eq. (15), Sj(t) − Si(t) + X =  − R, and the link duration t can be determined as:

$$ t=\frac{-{v}_r-\sqrt{v_r^2-2{a}_r\left(R+X\right)}}{a_r} $$
(14)

Equations (10), (13), and (14) can be used to calculate the link duration of the sending node and an arbitrary node within one hop range.

Lemma 3

Assuming that the communication link between two vehicles i and j breaks at time t, the link duration time is either a linear function of X or a square root function of X.

Proof

When the link disconnects at time t, by the definition of Si(t) according to Eq. (7), it is known that \( {S}_i(t)=\underset{0}{\overset{t}{\int }}{v}_i(x) dx \) is a linear function of t when the speed vi is constant, i.e., vi = vm. Let Si(t) = at + b. Similarly, if vj is a constant, it can be concluded that Sj(t) = ct + d is also a linear function of t. Obviously, according to Eq. (8), when the two vehicles are moving in opposite directions, (a + c)t + b + d + X = R. The link duration time \( t=\frac{R-b-d-X}{a+c} \) is a linear function. Similarly, when a link has been established between codirectional vehicles i and j, the link duration time is also a linear function according to the equation (c − a)t − b + d + X = R ⋅ I(i, j). Hence, the link duration t is a linear function when both vi and vj are constant. If either vi(t) or vj(t) are not constant, then the distance function will be a quadratic polynomial. Without loss of generality, let vi(t) = vi(0) + ait. By definition, the distance function is:

$$ {\displaystyle \begin{array}{l}{S}_i(t)=\underset{0}{\overset{t}{\int }}{v}_i(0)+{a}_i\\ {}\kern1.88em ={v}_i(0)t+\frac{1}{2}{a}_i{t}^2\end{array}}. $$

As shown in Eqs. (9) and (12), \( {S}_j(t)\pm {S}_i(t)=\frac{1}{2}{a}_r{t}^2+{v}_rt \) and a ≠ 0. Therefore, the solution must be a square root function of X.

Theorem 1

The duration T of the link between vehicles i and j is log-normally distributed.

Proof

By Lemma 3, the link duration can be expressed as either aX + b or \( \sqrt{aX+b}+c \). By Lemma 1, the expression \( \sqrt{aX+b}+c \) is log-normally distributed. By Lemma 2, aX + b is log-normally distributed. Hence, in all cases, the duration time of the link has a log-normal distribution. This completes the proof of the theorem.

3.3 Link reliability model

Based on the link duration between nodes, the link duration of vehicles i and j is log-normally distributed as T ∈ log N(μt, σt), where the expectation is μt and the variance is σt. Link reliability can then be evaluated according to duration. Link reliability is defined as the probability of direct communication between two vehicles in time period Tp. Assuming that the communication link between any two vehicle nodes is l, when t = t0, the link reliability between nodes is:

$$ r(l)=P\left\{{t}_0+{T}_p|{t}_0\in M\right\}, $$

where M is the starting time for the connection between the two vehicles. Hence, the link reliability is:

$$ {r}_t(l)=\Big\{{\displaystyle \begin{array}{cc}\underset{t_0}{\overset{t_0+{T}_p}{\int }}f(T) dT& {T}_p>0\\ {}0& \mathrm{else}\end{array}} $$
(15)

where f(T) is the probability function of the time interval T.

4 RSAR algorithm

In VANETs, the fast motion of the nodes makes the network topology change frequently. It is difficult for a simple forwarding strategy to find an efficient routing path in this rapidly changing network environment. Related studies have presented the Q-Learning algorithm as an unsupervised self-learning algorithm. Through continuous interaction with the external environment, it can adaptively adjust its value to find the optimal path to the destination. Hence, it is well suited to dynamic VANETs. This section will describe the routing and forwarding strategy of the Q-Learning algorithm in VANETs.

4.1 Idea of Q-learning algorithm in VANET

Standard Q-Learning is a heuristic learning method based on a learning agent. Generally, in the standard Q-Learning algorithm, the learning process of the agent is composed of a 3-tuple {S, A, R}, where S = {s1, s2, ⋯, sn} represents the state space; A = {a1, a2, ⋯, an} represents the activity space, and moving from one state to another is regarded as an effective activity; R represents the immediate reward for an activity; and the closer the agent is to the destination, the higher the reward obtained for the activity. The following subsections will present a few related definitions first and then describe the learning process in detail.

4.1.1 Related definitions

Definition 1

Basic components.

Learning environment: The whole VANET environment as the learning environment of the agent;

Agent: Each vehicle node is a learning agent;

State space S: The state space of a given agent is composed of all other nodes except for this agent;

Activity space A: A beacon packet is transmitted from one vehicle to another, which is defined as an activity;

Immediate reward R: obtained when an agent carries out an activity.

Definition 2

Reward value: According to Definition 1, the reward that an agent receives for carrying out an activity is called the immediate reward, and its range is [0, 1]. Because the destination node can directly reach the destination node, its reward value is 1. Formula (16) defines the initial value R of the entire network as follows:

$$ R=\Big\{{\displaystyle \begin{array}{cc}1& s\kern.3em \in \kern.3em {N}_d\\ {}0& \mathrm{else}\end{array}}. $$
(16)

where Nd represents the one-hop neighbor node set of destination node D. The reward value of an activity for all the neighbor nodes of the destination node is 1. In the learning process, the reward value that may be obtained from a transition from one state to another is indicated by the Q-value Q(s, a)(s ∈ S, a ∈ A), which lies in the range [0, 1].

Definition 3

Q-table: Each learning agent maintains a two-dimensional table, which is used to record the Q-values of the reachable destination node and its one-hop neighbor nodes. This two-dimensional table is called the Q-table (shown in Table 1).

Table 1 Q-table

In the Q-table, the first line contains the IDs of all possible destination nodes, which are expressed as Di. The first column contains the IDs of the one-hop neighbor nodes, which are expressed as Ni. Q(D1, N1) represents the Q-value between the node and the neighbor node N1 when it reaches the destination node D1. As shown in Table 1, the Q-table is a two-dimensional table, its size is determined by the number of neighbor nodes and the number of destination nodes. It is obvious that it has good scalability. The values in the Q-table are updated by periodically exchanging beacon packets between nodes. The learning task is distributed to each node, which makes the algorithm quickly converge to the optimal path, and adjustments to changes in network topology can be made in a timely manner.

4.1.2 Learning process

The above definitions establish that every vehicle node is defined as a learning agent. Unlike traditional ways of learning, each vehicle node has a Q-table. Nodes complete the task of learning by exchanging beacon information and updating their Q-tables. The beacon packet sent by each node contains not only its own speed, location, and other information, but also the maximum Q-value of the neighbor nodes to the destination node, i.e., the maximum value in a column (as shown in Table 2). Assume that there is a VANET topology G = {V, E} as shown in Fig. 2, where V = {A, B, C, ⋯, H} represents the set of vehicle nodes, and that for node A, its state space SA is the set of all nodes that do not contain A. Edge set E represents a collection of nodes that can communicate directly in one hop range. Assume that as in Fig. 3, A is the source node, and G is the destination node. Now the challenge is to find an optimal path from the sending node A to the destination node G using the self-learning approach.

Table 2 Network parameter settings
Fig. 3
figure 3

Model of the learning process

Learning tasks are assigned to each vehicle node agent, and the learning process is mainly updating the Q-table in the agent and meanwhile updating the state activity pair of the Q-value, Q(s, a)(s ∈ S, a ∈ A). The standard Q-Learning function is given as Eq. (17):

$$ {Q}_s\left(d,x\right)\leftarrow {Q}_s\left(d,x\right)+\left\{R+\gamma \cdot \underset{y\in {N}_x}{\max }{Q}_x\left(d,y\right)\right\}, $$
(17)

where Qs(d, x) is the Q-value to be updated, S is the node, x is its neighbor node, Nx is x‘s neighbor node, D is the destination node, R is the reward value, and \( \underset{y\in {N}_x}{\max }{Q}_x\left(d,y\right) \) is the maximum Q-value between x and its neighbor node to the destination node D. The discount factor γ is an important parameter because it affects the reward that a node obtains for carrying out an activity according to Eq. (17). Considering that link stability is an important parameter, the link reliability r(l) between nodes, which was calculated according to Eq. (15), was used as a discount factor, that is, γ = r(l). In VANETs, the available bandwidth, as an important parameter, determines the rate of packet transmission. As defined in [28], the bandwidth BW can be calculated as:

$$ BW(bps)=\frac{n\times {S}_B\times 8}{T}, $$
(18)

where n represents the number of packets that the node sends and receives, SB is the size of the packet in bytes, and T is the time interval. Assuming that the maximum available bandwidth of the node is a fixed value defined as maxBW, the bandwidth factor can be calculated as:

$$ BF=\frac{\max_{BF}- BW}{\max_{BF}}. $$
(19)

As the factor that affects learning speed, the bandwidth factor varies with changes in effective bandwidth and determines the learning progress of each vehicle node. Equation (17) can be modified to obtain a new heuristic function:

$$ {Q}_s\left(d,x\right)\leftarrow \left(1- BF\right)\cdot {Q}_s\left(d,x\right)+ BF\cdot \left\{R+\gamma \cdot \underset{y\in {N}_x}{\max }{Q}_x\left(d,y\right)\right\}. $$
(20)

According to Eq. (20), the greater the number of hops, the smaller the reward value. Hence, the final reward value is based on the number of hops, the link reliability, and the bandwidth. By adding bandwidth and link state, the optimal path from source node to destination node can be obtained in the dynamic network. In Fig. 3, nodes E and F are one-hop neighbor nodes of destination node G. According to Eq. (20), the reward values from nodes E and F to destination node G can be represented as QE(G, G) and QF(G, G), respectively. Considering the effects of link quality and bandwidth, it can be determined that the final Q-values of QE(G, G) and QF(G, G) are 0.7 and 0.8, respectively. The neighbor nodes of D are A, B, C, E, F, and H. When D receives the beacon packets sent from any neighbor, the data packets are parsed, and the maximum Q-value to the destination node G is extracted, taking F, \( \underset{y\in {N}_F}{\max }{Q}_F\left(G,Y\right) \). According to Eq. (20), calculate the corresponding Q-value, i.e., QD(G, F), and update the Q-table. Obviously, QF(G, G) is the largest, and its value is 1, which will be recorded in the beacon packet. A similar procedure will be performed with data packets from other nodes; then one particular column in the Q-table is updated. Considering bandwidth and link reliability and assuming that QD(G, F) = 0.5 according to Eq. (20), the Q-tables of the other neighbor nodes are updated, as shown in Fig. 4. As neighbor data packets are continually received, node D constantly modifies the maximum Q-value between node D and its neighbor nodes in the Q-table. Similarly, when node D sends a beacon packet, it traverses a particular column in the Q-table, finds the maximum Q-value among node D and its neighbor nodes, and records and sends the Q-value. When node A receives the beacon packet sent from node D, it forwards the maximum Q-value and carries on the computation to update QA(G, D). The same procedure will be performed when it receives beacon packets from other nodes. Through constant exchange of data packets, the results shown in Fig. 4 will finally be obtained.

Fig. 4
figure 4

Q-values to the destination node G as saved by each node

According to Fig. 4, the optimal path from A to G is easily found, i.e., the path with the biggest Q-value of its nodes is the optimal path. As Fig. 4 shows, A → B → E → G is the path with the maximum Q-value and therefore is the optimal path. The dynamic update-and-save of the Q-table makes the algorithm respond to dynamic changes in topology quickly and with good reliability as well as robustness.

4.2 Description of the RSAR algorithm

This subsection describes the basic process of data packet forwarding in the RSAR algorithm and then provides a detailed description of each stage.

4.2.1 Basic transmission process

When the RSAR algorithm routes data packets, it performs the following three steps:

  1. 1.

    When a source node sends a data packet, it checks its own Q-table to see whether its first entry is the next hop node to the destination. If so, then it selects the neighbor node with the largest Q-value; if not, then it starts the route development process, which is described in subsection 4.2.2.

  2. 2.

    After the route has been developed, a basic path from the source node to the destination node has been obtained, and the learning of some vehicle nodes is complete. To find the optimal path of the whole network topology and solve the network segmentation problem, the route maintenance process is started to maintain the end-to-end path dynamically. The route maintenance process is described in subsection 4.2.3.

  3. 3.

    Through the processes described above, the optimal path of the entire network topology is established. When a vehicle node receives/sends data packets, it implements the first step; otherwise, it will implement the second step.

4.2.2 Route development process

At the beginning, when the source node sends a data packet to the destination node, it checks the Q-table to see whether there is a next-hop node to the destination node. It then finds a neighbor node with the maximum Q-value to the destination node and forwards the data packet to it. If there is no such node, it starts the route discovery process. The source node sends an R _ req data packet by broadcasting to the entire network and starts a path request timer, where R _ req records the IDs of all the nodes that it passed through in the routing process. When the destination node receives the first R _ req packet from the source node, it saves the packet, and subsequently received packets will be discarded. From the R _ req data packet, the destination node extracts the ID information that it recorded and flips it, then generates an R _ rep data packet and writes the flipped path information into R _ rep. After waiting for a time slot, it broadcasts the R _ rep data packet to the nodes recorded in the flipped path. When a recorded node receives the data packet, it modifies the next-hop address, updates its Q-table, and sends out the packet through single-hop broadcast. Other non-destination nodes only modify their Q-tables and drop the packet when they receive it. Until R _ rep is sent to the destination node, i.e., the source node of R _ req, the source node cancels the request timer while updating its Q-table. At this time, a path has been found from the source node to the destination node, and the Q-tables of the nodes on the first path from source node to destination node are updated for the first time; their initial values in the Q-table are 0.

4.2.3 Route maintenance process

When the first route path is established, the Q-table values of some nodes adjacent to the path are also updated. To ensure that the path remains effective with dynamic changes in the network, it is important to start the route maintenance process. The main purpose of the route maintenance process is to maintain the Q-tables dynamically and to solve the network segmentation problem. Each node periodically broadcasts beacon packets to update the Q-tables of neighbor nodes, where the beacon data packet is mainly composed of the position, speed, and max(Q − Value) of the node. max(Q − Value) was described along with the learning process. To ensure update effectiveness, the transmission delay of the beacon packet in the experiment was set to a random number in the interval [0.5, 1]. The effective time of the destination node is specified for the Q-table. If the time of a destination node is longer than the specified time because it has not been updated, the destination node is considered invalid, and the corresponding column of data is deleted. When a network partition emerges due to vehicle movement, RSAR uses a store-and-forward strategy and starts the path request timer to broadcast an R _ req data packet. If it does not receive the R _ rep data packet sent from the destination node until the timer has expired, the destination node is considered to be unreachable, and the source node is informed to cancel the transmission; otherwise, the routing path will be reestablished at the breakpoint.

5 Experimental tests

NS-2 was used to verify the performance of the RSAR algorithm proposed in this paper. The simulation environment (scenario and parameter setting) will first be described, followed by a detailed analysis of the experimental results.

5.1 Simulation environment settings

NS-2 was used to simulate the experiment, and VanetMobiSim was used to generate a 1500 × 1500 square topology, as shown in Fig. 5. The topology consisted of intersections and straight roads, where each road was set to two two-way lanes, traffic lights were set at three intersections, and the change time of traffic lights was 5 s. To highlight the authenticity of the simulation, in the network environment, every vehicle node used an intelligent driving model (IDM), including changing lanes, overtaking, avoiding, and waiting. The generated mobility model was added to NS-2. Table 2 shows the basic NS-2 parameter settings. Over the whole network, 10 pairs of CBR data streams were randomly generated to send packets with a size of 512 bytes, and the transmission layer used UDP. The size of each beacon packet was calculated according to the transmitted information. To verify the performance of the RSAR algorithm effectively, two situations were set up to simulate the scenario in Fig. 5. In the first situation: the number of nodes in the entire network was set to a fixed value of 80, and the maximum speed of the vehicles ranged from 30 km/h to 90 km/h. In this situation, the discussion will focus on the effect of vehicle speed on the routing protocol. The second situation can be described as follows: the maximum speed of the vehicles was set to a constant 40 km/h, and the number of nodes varied from 60 to 120. In this situation, the discussion will focus on the effect of vehicle density on the routing protocol. At the beginning of the experiment, the nodes were randomly dispersed on different roads and moved on a fixed route. Each simulation time was set to 300 s, each situation was run 20 times, the average value was taken, and the simulation results were compiled.

Fig. 5
figure 5

Simulation topology

5.2 Experimental results

The RSAR algorithm proposed in this paper was compared with the GPSR algorithm [17], the SLBF algorithm [11], the QLAODV algorithm [14], and the methods outlined in [22, 24, 26, 33, 37, 39]. These algorithms are all representative routing algorithms. This comparison makes it possible to evaluate the advantages and disadvantages of RSAR. In different scenarios, through comparisons of packet delivery ratio, end-to-end delay, and number of hops, the results shown in Figs. 5, 6, 7, 8, 9, 10, 11, and 12 were obtained. Figures 5, 6, and 7 show the comparisons in the first test case, and Figs. 8, 9, and 10 show the comparisons in the second one.

Fig. 6
figure 6

Packet delivery ratio versus speed

Fig. 7
figure 7

End-to-end time delay versus speed

Fig. 8
figure 8

Routing length versus speed

Fig. 9
figure 9

Packet delivery ratio versus number of nodes

Fig. 10
figure 10

End-to-end time delay versus number of nodes

Fig. 11
figure 11

Average route length versus number of nodes

Fig. 12
figure 12

NTO comparison among different methods: a the method proposed here, b [22], c [24], d [26]

Figure 6 compares the relationship between packet delivery ratio and speed. The packet delivery ratio is the ratio of the number of data packets received by the destination node to the number of data packets sent by the source node. Figure 6 shows that, with increasing speed, the RSAR proposed in this paper achieved a high and stable packet delivery ratio, with an average ratio greater than 90%. The packet delivery ratios of the other three routing algorithms examined showed a sharp downward trend with increasing speed. This improvement was achieved because RSAR fully considered the influence of speed changes on link reliability. By evaluating the links between nodes, the reliability of these links was determined, and this information was used in routing decisions as a learning parameter of the Q-Learning algorithm. The packet delivery ratio of GPSR declined most quickly due to its selection of the next hop node without considering link reliability and due to its greedy mechanism. When the speed was less than 54 km/h, the packet delivery ratio of QLAODV was also greater than 90%, but when the speed exceeded 54 km/h, the ratio dropped sharply. Although it uses the Q-Learning model, because of the need for maintenance of an end-to-end reliable path, QLAODV must repair the path continuously with increasing speed, leading to a decline in the ratio. The packet delivery ratio of SLBF also declined because it is affected by topology changes and link stability is not fully considered. Due to its use of a greedy approach, it was closer to the GPSR algorithm.

Figure 7 shows the relationship between end-to-end time delay and speed, where the time delay is calculated only as the average time taken by the destination node to receive valid data packets. Figure 7 shows that with increasing speed of the vehicle nodes, the time delays of the four algorithms all presented a rising trend. The time delay of the RSAR algorithm proposed in this paper was between those of GSPR and SLBF and was relatively stable. When the maximum speed exceeded 60 km/h, time delay of SLBF increased rapidly and became greater than that of RSAR. This trend was caused by the influence of topology changes and the need for recalculation of the effective forwarding area or re-transmission mechanism. Although RSAR was less affected by topology changes, the path with maximum Q-value is bound to have the largest bandwidth, the most reliable link, and the shortest routing length in this algorithm. The time delay of GPSR was the shortest because it only uses the greedy forwarding mode. The packet delivery ratio of GPSR had the lowest value because it only uses the greedy mechanism, so that it immediately drops packets when they fail to arrive. QLAODV also uses the Q-Learning model, but increasing speed made it switch paths frequently to maintain the effective routing path. This meant that the time delay of packet delivery was greatly increased. From Fig. 7, the packet delivery ratio of RSAR was between 0.1 and 0.2 under the rapidly changing topology environment. From Fig. 6, its packet delivery ratio was greater than 90%. It is necessary for these applications to transfer large amounts of data.

Figure 8 shows the relationship between average route length and speed, where the route length was calculated by the average number of hops taken for valid data packets to reach the destination. Figure 8 shows that the route length in RSAR was less than that of QLAODV because it does not use the path transformation mechanism to maintain the whole path, and the forwarding decision is used only to choose the node with maximum Q-value, so that its time delay is much shorter than that of QLAODV, as shown in Fig. 7. Figure 8 also shows that the RSAR proposed in this paper has a stable route length. This can be achieved because of the store-and-forward mechanism and the maximum Q-value selection mechanism, which ensures that each path selected is the shortest one. SLBF and GPSR both use a greedy approach to forward packets, but when the speed exceeds 60 km/h, under the influence of topology changes, the route length of SLBF increases; although GPSR does not use a reliable mechanism, the number of hops reached the minimum value.

Figure 9 shows the relationship between the packet delivery ratio and the number of nodes. As the number of nodes increases, the packet delivery ratio in all four algorithms shows a rising trend. When the number of nodes is 90, the packet delivery ratio of RSAR is greater than 90%, whereas that of QLAODV is 85%. Although these two algorithms are both based on the idea of Q-Learning, RSAR fully considers link stability between nodes so that the packet delivery ratio is greatly increased. Due to the store-and-forward mechanism, the packet delivery ratio of RSAR is much higher than that of SLBF and GPSR as the number of nodes increases. Because SLBF uses the re-transmission mechanism, its packet delivery ratio is greater than that of GPSR; but because it is greatly affected by topology and the vehicle and the sending node are usually not on the same path, its packet delivery ratio is seriously affected and becomes lower than that of RSAR.

Figure 10 shows the relationship between end-to-end time delay and the number of nodes. The packet delivery ratio of RSAR gradually draws close to that of GPSR. The reason for this is that as the number of nodes increases, the number of learning nodes in RSAR also increases, so that the path to the destination node is shorter and the time delay is less. As the number of nodes increases, the time delay of QLADOV gradually draws close to that of RSAR, but with fewer nodes, it is much larger than that of RSAR. This behavior occurs because QLAODV spends much time on route maintenance. As the number of nodes increases, SLBF decreases slowly because it uses a timed broadcast mechanism that increases time delay. The time delay of all four algorithms shows a downward trend as the number of nodes increases.

Figure 11 shows the relationship between the average route length and the number of nodes. As the number of nodes increases, the average route length of all four algorithms shows a downward trend. This occurs mainly because the number of effective forwarding nodes increases. The length of the path found by RSAR is shorter than that of QLAODV. The path lengths of RSAR, SLBF, and GPSR are close because SLBF and GPSR both use a greedy approach. As the number of nodes increases, the next hop selected by RSAR is closer to the farthest node.

To show comprehensive comparative results for recent algorithms, extensive simulations and experiments have been conducted in this study to examine the performance of the proposed method. The Normalized Transmission Overhead (NTO) was defined to analyze the complexity of the proposed method and its cost. The metrics for the degree of complexity of the proposed method include context gathering overhead (Ocg) for VANET, transmission overhead (Ot) of data packets among VANET nodes, and others. According to classical computational complexity analysis methods [29,30,31], from the equations previously given in this paper, the degree of computational complexity of the proposed method can be determined as O(n):

$$ NTO=\sum \limits_{i=1}^{MC}\sum \limits_{j=1}^{MC}\left(\xi {O}_{\left(i,j\right)}^{cg}+\varpi {O}_{\left(i,j\right)}^t+\left(1-\xi -\varpi \right){C}_{\left(i,j\right)}^{de}\right), $$
(21)

where MC is the number of vehicle nodes in each evaluation group in VANET, such as MC = 10, 20, 30, 40, 50, or 60, which means that there is a related number of vehicle nodes in a certain evaluation group in VANET. The integer parameters i and j are the node indices of each evaluation group in VANET. Parameters ξ, ϖ are real weighting values, ξ, ϖ∈ [0, 1] and ξ + ϖ ≤ 1, such that the default real values are ξ = 0.6, ϖ = 0.4.\( {O}_{\left(i,j\right)}^{cg},{O}_{\left(i,j\right)}^t,{C}_{\left(i,j\right)}^{de} \) are the normalized real overhead values of vehicle nodes i and j for each evaluation group in VANET.

Based on Eq. (21), from the relative results comparison of existing methods and the method proposed here as shown in Fig. 12 ((a) the proposed method, (b) [22], (c) [24], (d) [26]), it is evident that the method proposed here has better performance than those proposed in Refs. [22, 24, 26]. This guarantees high QoS (such as lower overhead, less complexity, shorter delays, and transmission reliability) of VANET. Therefore, according to the above results, the method proposed here has certainly improved on routing overhead, transmission delays, rate of packet delivery, rate of losing packets, throughput, and other performance metrics for VANET.

In addition, many relative comparisons have been performed with the methods proposed in [33, 37, 39]. These relative comparison figures were not reported in this paper because the effects were similar to those shown in Fig. 12.

6 Conclusions

The routing service algorithm is a very important part of VANET. An efficient routing algorithm can greatly improve the data transmission rate, so that many applications can be run in VANET. To overcome the problems of dramatic topology changes and unreliable links caused by fast vehicle movements in VANET, a reliable adaptive routing service algorithm is proposed in this paper. First, the properties of vehicle motion and the factors that make links unreliable were studied, and the duration of a link was proved to have a log-normal distribution. Second, on the basis of this, a link reliability calculation model was developed. The link reliability between nodes was evaluated as a parameter and applied in the Q-Learning algorithm, following which the RSAR algorithm was proposed. Finally, simulation experiments were conducted to evaluate the performance of four routing algorithms, including packet delivery ratio, end-to-end time delay, and number of hops. The results showed that RSAR had a higher packet delivery ratio under various conditions and a low transmission time delay. RSAR can effectively solve the problems caused by changes in topology through self-learning. However, because the learning process is a kind of local diffusion and there can be very many nodes participating in selecting the route, the routing expense in a large network environment will be enormous. Hence, future work by the authors will be focused on the problem of routing expense.