Keywords

1 Introduction

Unmanned aerial vehicle (UAV) path planning refers to find a path between the starting position and the destination under the conditions of terrain, radar, and other factors [1]. In last decades, many methods have been proposed to solve the issue of UAV path planning in 2D environment, which includes Graph-based algorithm [2, 3], Rapidly-exploring Random Trees algorithm (RRT) [4], artificial potential field algorithm (APF) [5, 6], and reinforcement learning [7]. Population-based algorithms are the most commonly methods used to plan path for the UAV, such as genetic algorithm (GA) [8, 9], particle swarm optimization (PSO) [10], and ant colony algorithm (ACO) [11, 12]. This kind of algorithms has the advantages of fast convergence speed, good parallelism and facilitate collaboration among multiple populations.

Compared to other algorithms, PSO is often utilized in the path planning of UAV. However, conventional PSO is confronted with several challenges. Firstly, the initialization process of the particles is complex, leading to suboptimal initial paths. Secondly, these algorithms exhibit slow convergence rates and are prone to getting trapped in local optima. In order to solve the above problems, we propose an EPSO-GA, and furthermore adopted it on the UAV path planning. The main contribution of this paper can be summarized as follows.

(1) A hybrid approach of Q-learning and random initial solutions is applied to find the initial paths for the UAV, which improves the quality of initial paths and accelerates the convergence of the EPSO-GA.

(2) The acceleration coefficients of EPSO-GA are designed as adaptive ones by the fitness value to make full use of all particles and strengthen the global search ability of the algorithm. In addition, a heuristics factor is recommended into mutation to speed up the convergence of the algorithm.

2 The Proposed Method

In this section, we construct the objective function for UAV path planning. To enhance the quality of the initial population in the particle swarm algorithm, we introduce Q-learning for hybrid initialization. Furthermore, we accelerate the convergence speed and improve the global search capability of the algorithm by incorporating mutation and crossover mechanisms.

2.1 Objective Function Construction

In this article, path length is defined as follows:

$$\begin{aligned} \begin{aligned} f_{L}=\sum _{i=1}^{n} \sqrt{\left( x_{i+1}-x_{i}\right) ^{2}+\left( y_{i+1}-y_{i}\right) ^{2}} \end{aligned} \end{aligned}$$
(1)

where \(x_{i}\) and \(y_{i}\) are the coordinates of path point i, and n is the total number of path points.

To simplify the calculation, the path from the starting point to the destination is divided into three segments. For each segment, the 1/4, 2/4 and 3/4 points of the outbound leg are used as reference points. The formula for calculating the threat intensity of the jth radar against the drone is as follows:

$$\begin{aligned} \begin{aligned} T_j=\sum _{m=1}^3\,K\left( \frac{1}{d_{1 / 4, m, j}^4}+\frac{1}{d_{2 / 4, m, j}^4}+\frac{1}{d_{3 / 4, m, j}^4}\right) \end{aligned} \end{aligned}$$
(2)

where m represents the path segment, K is the threat intensity coefficient, \(\frac{1}{d_{1 / 4, m, j}}\) represents the distance between the jth radar and a quarter of the m path segment, and \(T_j\) represents the threat intensity of the jth radar against the drone. Therefore, the threat intensity of a single drone is described as follows:

$$\begin{aligned} \begin{aligned} f_T=\sum _{j=1}^N T_j \end{aligned} \end{aligned}$$
(3)

where N is the number of radars.

In order to reduce computational complexity, the path planning problem for UAV is transformed into a constrained multi-objective function. By assigning weights to each objective, transforming the multi-objective problem into a single-objective problem. The objective function can be represented as follows:

$$\begin{aligned} \begin{aligned} f_{obj}=w_1 * f_L+w_2 * f_T \end{aligned} \end{aligned}$$
(4)

where \(w_1\) and \(w_2\) are the weights for path length and threat intensity respectively.

2.2 Initialization Method of PSO Particle Swarm Based on Q-Learning

The hybrid initialization method of Q-learning and random initial population is used in this paper to obtain the initial swarm of particles. In order to ensure that the paths generated by Q-learning have lower threat intensity and shorter path length compared to those generated randomly, the reward function of Q-learning is shown in Table 1.

Table 1. Design of reward function.

In Q-learning, all feasible points, obstacles and radars are distinguished in the form of positive and negative values. Each cell represents a different value, where 0 represents the starting point, r represents free space, −1 represents an obstacle or radar occupied cell, and 20 represents the destination. The reward values of all feasible points are adaptively adjusted based on their positions, where r>0. The reward value of feasible points increases as they approach the destination and decreases as they approach a threat. The reward values of all feasible points are shown below:

$$\begin{aligned} \begin{aligned} r=r_0+r_{\max } * \exp \left[ \frac{-\sqrt{\left( x_u-x_{\textrm{g}}\right) ^2+\left( y_u-y_{\textrm{g}}\right) ^2}}{k_{\textrm{g}}}\right] \ r_{\textrm{t}} * \exp \left[ \frac{-\sum _{i=1}^m \sqrt{\left( x_u-x_{\textrm{it}}\right) ^2+\left( y_u-y_{\textrm{it}}\right) ^2}}{k_{\textrm{t}}}\right] \end{aligned} \end{aligned}$$
(5)

where \(r_0\) represents the basic reward value, \(r_{\max }\) represents the maximum reward value without considering threats, \(r_{t}\) is the threat coefficient, \(\left( x_{u}, y_{u}\right) \) and \(\left( x_{g}, y_{g}\right) \) are the coordinates of the UAV and the target point respectively, \(\left( x_{it}, y_{it}\right) \) is the coordinates of the ith radar, \(k_{g}\) and \(k_{t}\) are the control coefficients of the reward function and m is the number of threats.

The balance between exploration and exploitation is another key part in Q-learning. This paper adopts Boltzmann distribution [13], and its expression is as follows:

$$\begin{aligned} \begin{aligned} p(a \mid s)=\frac{e^{Q(s, a) / T}}{\sum _{a_{i} \in A} e^{Q\left( s, a_{i}\right) / T}} \end{aligned} \end{aligned}$$
(6)
$$\begin{aligned} \begin{aligned} T=\lambda ^{k} T_{0} \end{aligned} \end{aligned}$$
(7)

where \(p(a \mid s)\) represents the probability that action a is selected in state s. \(\lambda \) is a constant satisfying \(0< \lambda < 1\), k is the current iteration number and T is a control parameter. At the beginning of the training process, T has a large value to ensure strong exploration capability, which decreases as the number of iterations increases to ensure that the algorithm focuses on exploitation. The update formula for Q values is as follows:

$$\begin{aligned} \begin{aligned} Q(s, a) \leftarrow Q(s, a)+\alpha \left[ r+\gamma \max Q\left( s^{\prime }, a^{\prime }\right) -Q(s, a)\right] \end{aligned} \end{aligned}$$
(8)

where \(\alpha \) is the learning rate, \(\gamma \) is the discount factor, r is the immediate reward value of the current action and \(Q\left( s,a \right) \) is the estimated value of taking the current action.

Using the Q-learning method based on the fitness values of all solutions, m paths are generated, and n paths are obtained by random initialization. The initial population size \(S=m+n\).

2.3 PSO Mutation Crossover Strategy

Inspired by the way of multiple UAV task assignment in [14], the updating strategy of EPSO-GA can be defined as follows:

$$\begin{aligned} \begin{aligned} x_{i, j}(t+1)=c_{2} \bullet f_{3}\left( c_{1} \bullet f_{2}\left( w \bullet f_{1}\left( x_{i, j}(t), p_{i, j, \text{ best } }(t)\right) , g_{i, j, \text{ best } }(t)\right) \right) \end{aligned} \end{aligned}$$
(9)

where \(x_{i, j}(t)\), \( p_{i, j, b e s t}(t)\) represent the position and personal optimal value of particle in t iterations respectively, while \(g_{i, j, \text{ best } }(t)\) represents the global best value of the particle in the t-th generation. w is the learning factor of the particle with respect to itself, \(c_{1}\) and \(c_{2}\) are the learning factors of the current particle with respect to its individual best value and global best particle, respectively. \(f_{1}\) is the operation of the particle with respect to itself, while \(f_{2}\) and \(f_{3}\) are the operations of the particle based on its individual best value and global best value, respectively.

\(f_1\) is defined as a mutation where the mutation probability of the particle i is w, \(f_1\) is formulated as follows:

$$\begin{aligned} \begin{aligned} f_{1}=w \bullet f_{1}\left( x_{i, j}(t)\right) \end{aligned} \end{aligned}$$
(10)

where w is the inertia weight which has a great influence on performance of the algorithm. In order to accelerate the convergence of the algorithm, we adopt the linear time-varying inertia weight updating strategy in [15], which is expressed in the following formula:

$$\begin{aligned} \begin{aligned} w(t)=\frac{T-t}{T}\left( w_{\max }-w_{\min }\right) +w_{\min } \end{aligned} \end{aligned}$$
(11)

where the value of w is linearly decreased from \(w_{\max }\) to the final value \(w_{\min }\). t is the current iteration of the algorithm and maxiter is the maximum iterations of the algorithm. At the beginning of the iteration, the particle exhibits a strong global search ability, while towards the end of the iteration, it acquires a local search capability.

Fig. 1.
figure 1

Crossover and mutation mechanisms.

As shown in the Fig. 1(a), the mutation operation begins by randomly selecting a point in the path (excluding the start and end points). Assuming the previous coordinate is N, there are 8 possible choices for the mutated coordinate. To improve the efficiency of search, the distance \(D_i\) between each candidate i and the destination is calculated, and the probability \(P_i\) of the i being selected is calculated as follows:

$$\begin{aligned} \begin{aligned} P_{i}=k_{i} * \frac{1 / D_{i}}{\sum _{i=1}^{8} 1 / D_{i}} \end{aligned} \end{aligned}$$
(12)

where \(k_i\) is the adjusted adaptively according to the angle between coordinate i and the destination. In particular, the probability is 0 when i represents an obstacle.

To address the coordinate transformation issue, this paper introduces the crossover mechanism from genetic algorithms while retaining the individual and global learning strategies of the particle swarm.

\(f_2\) is defined as the intersection between particle \(x_{i,j}\) and the individual best particle with a probability of \(c_{1}\). The formula for \(f_2\) is as follows:

$$\begin{aligned} \begin{aligned} f_{2}=c_{1} \bullet f_{2}\left( f_{1}, p_{i, j, \text{ best } }(t)\right) \end{aligned} \end{aligned}$$
(13)

\(f_3\) is defined as the intersection between particle \(x_{i,j}\) and the global best particle with a probability of \(c_{2}\). The formula for \(f_3\) is as follows:

$$\begin{aligned} \begin{aligned} f_{3}=c_{2} \bullet f_{3}\left( f_{2}, g_{i, j, b e s t}(t)\right) \end{aligned} \end{aligned}$$
(14)

If there is an overlap between the path points of the current particle and either the individual best particle or the global best particle (excluding the start and end points), a crossover operation is performed. In Fig. 1(b), particle x shares two crossover points (63, 92) with the individual best particle. This operation involves swapping the path point 71 between the two crossover points and swapping the subsequent crossover point with the path points (112, 131) between the individual best particle and particle x’s target point. In Fig. 1(d), particle x shares more than two crossover points (63, 92, 131) with the individual best particle. In this case, only the path points 71 and 112 between the single best particle and particle x’s crossover point are swapped.

In EPSO-GA, the values of \(c_{1}\) and \(c_{2}\) are updated based on their fitness values to fully utilize all particles and enhance the algorithm’s ability to escape local optima. In each iteration, all particles are sorted based on their fitness values, with the top half being stored in set A and the bottom half in set B. In the next iteration, particles in set A have a higher probability of crossover. The formulas for updating the learning factors \(c_{1}\) and \(c_{2}\) are as follows:

$$\begin{aligned} \begin{aligned} \eta =\frac{min _{-} \text{ fitness } }{max _{-} \text{ fitness } }-\frac{min _{-} \text{ fitness } }{ \text{ fitness } \left( x_{i j}\right) } \end{aligned} \end{aligned}$$
(15)
$$\begin{aligned} \begin{aligned} c_{1}=a+(1-a) e^{\eta } \end{aligned} \end{aligned}$$
(16)
$$\begin{aligned} \begin{aligned} c_{2}=b+(1-b) e^{\eta } \end{aligned} \end{aligned}$$
(17)

In the formulas above, \(max _{-} \text{ fitness } \) and \(min _{-} \text{ fitness } \) represent the maximum and minimum fitness values of all particles in the population, respectively. \( \text{ fitness } \left( x_{i j}\right) \) represents the fitness value of particle j after i iterations, and the values of a and b are experimentally determined constants. For particles in set A, higher fitness values result in higher crossover probabilities. The particle with the highest fitness value will always undergo crossover, while the probabilities of the other particles decrease exponentially with their fitness values, but still satisfy \(c_{1}>a\), \(c_{2}>b\). Through these operations, the top half of the particles with higher fitness values are given an increased probability of crossover, allowing them to focus more on searching, improving the algorithm’s global search capability, and helping to escape local optima.

For the bottom half of the particles with lower fitness values, the values of \(c_{1}\) and \(c_{2}\) are designed as follows:

$$\begin{aligned} \begin{aligned} c_{1}=c_{\max }-\frac{c_{\max }-c_{\min }}{max _{-} iter} * t \end{aligned} \end{aligned}$$
(18)
$$\begin{aligned} \begin{aligned} c_{2}=c_{\min }+\frac{c_{\max }-c_{\min }}{max _{-} iter} * t \end{aligned} \end{aligned}$$
(19)

Here, both \(c_{\max }\) and \(c_{\min }\) are set to \(\left( c_{\max }+c_{\min }\right) / 2=c_{1}^{c}=c_{2}^{c}\), where c is a constant satisfying \(1>c_{\max }>c_{\min }\). \(max _{-} iter\) represents the maximum number of iterations, and t represents the current iteration number. As the number of iterations increases, the value of \(c_{1}\) gradually increases while the value of \(c_{2}\) gradually decreases, causing the algorithm to shift from global search to local search.

Additionally, this article simplifies the paths by performing simplification operations when there are no obstacles or threat points blocking a straight line between two non-contiguous positions. As shown in Fig. 1(c), if the UAV can fly directly from the current point to another waypoint in a straight line, the path A-C-B will be replaced by the simpler path of A-B.

3 Simulation

To demonstrate the performance of proposed method, simulations and comparisons are carried out in 2D static environment. They are implemented in the MATLAB environment and compared performance with CIPSO [16] and CIGA [17], the simulation is running on a platform with a 3.2 GHz CPU and 8.0 GB of RAM.

3.1 Parameters Setting

The weight of objectives are set as \(w_1\) = 0.8, \(w_2\) = 0.2, the size of population is set as S = 50 and \(max\_iter\) = 100. The parameters of the EPSO-GA algorithm are set as \(w_{max}=0.8\), \(w_{min}=0.2\), \(c_{max}=a=b=0.8\), \(c_{min} =0.6\). The parameters of the CIPSO algorithm are set as \(w_{max}=0.9\), \(w_{min}=0.4\), \(c_{max}=3.5\), \(c_{min} =0.5\), \(V_1 = 0.5\), \(V_2 = 0.1\), \(a=2\), \(\mu =4\). The parameters of the CIGA algorithm are set as \(c_r=0.8\), \(pc=0.15\). Coordinates of starting position, destination and radars in the case are listed in Table 2.

Table 2. The starting position and the destination of UAV and the coordinates of the radars.

3.2 Result of UAV Path Planning

To compare the effect of the proportion of paths obtained by the Q-learning algorithm on the optimization of the algorithm, the number of paths m obtained by the Q-learning and the number of paths n obtained by random initialization were adjusted respectively, and five groups of experiments were conducted in the complex environment, with each group of experiments running independently for 20 times. The experimental results are shown in Table 3. Where m = 0 and n = 50 means purely random initialization, the initial best fitness is reduce with the proportion of m increase, and the convergence result of the algorithm is the best when m = 5 and n = 45. However, the average iteration number increases as m increases when \(m > 5\). Therefore, compared with the random initialization, the path obtained by introducing Q-learning reduces the initial best fitness value of the population, which can accelerate the convergence speed of the algorithm. However, if the proportion of m is too large, the algorithm may fall into local optimal and the number of iterations of the algorithm will be increased.

Table 3. Results of EPSO-GA under different m and n values.
Fig. 2.
figure 2

Simulation results under Z-type environment.

Fig. 3.
figure 3

Simulation results under complex environment.

The comparative simulations among different algorithms are carried out in two different environments, including Z-type and a complex environment, each method is repeated 50 times independently and the best results are chosen. The generated paths and corresponding convergence curves under different environments are presented in Fig. 2 and Fig. 3. The statistical results of simulation are shown in Table 4. As observed in Fig. 2(a) and Fig. 3(a), CIGA, CIPSO and our proposed method are complete the mission from the staring position to the destination without any collision with obstacles and radars. In the Z-type environment, the path generated by CIGA has a longest path and closest to the radars which the value of path length and threat intensity are 42.1637 and 17.0397 respectively, while CIPSO has a shortest path with 39.2854 and a second highest threat intensity with 16.6979. Our method is slightly longer than CIPSO where the path length is 39.7274, but maintain a relative longer distance from radars which threat intensity is only 9.5938. In the complex environment, our method complete the mission with the shortest path length and the lowest threat intensity.

In the Fig. 2(b) and Fig. 3(b), the best fitness of EPSO-GA are obviously smaller than that of the other algorithms at the initial stage, which are the results of hybrid initialization. Besides, EPSO-GA holds the fastest convergence compare with the other methods, while reach its optimal and remain stable in 12th and 17th iteration under two different environments respectively.

Table 4. Results comparison between the algorithms under different environments.

A successful search is defined as finding the optimal solution after 200 iterations. As shown in Table 4, the EPSO-GA algorithm we proposed ranks first in the success rate. In terms of threat intensity, EPSO-GA are 43.7%, 42.5% better than CIGA and CIPSO in the Z-type environment, 15.8%, 51.6% in the complex environment respectively. Although our proposed method consumes slightly more time than the CIGA which due to the adaptive adjustment of parameters and hybrid initialization, the optimality of the solutions we have obtained. Our proposed EPSO-GA can effectively generate optimal path and is more practical in off-line path planing.

Table 5. Results comparison based on different weights under complex environment.

In order to compare the influence of the weight of path length and threat intensity on the optimization process, four groups of experiments are conducted in complex environment by adjusting the weights of two sub-objectives, and each group of experiments is run independently for 20 times. The experimental results are shown in Table 5. It can be seen that the average iteration number of each algorithm is the least, the best fitness and the average fitness of ESPO-GA algorithm are the minimum, and the obtained path quality is the best when \(w_1\) = 0.8, \(w_2\) = 0.2.

4 Conclusion

In this paper, a hybridization of EPSO-GA algorithm is proposed for path planning of UAV in 2D static environment. Q-learning is utilized to initialize paths of UAV which improves the quality of initial paths and accelerates the convergence of the EPSO-GA. The acceleration coefficients of EPSO-GA are designed as adaptive ones by the fitness value to make full use of all particles and strengthen the global search ability of the algorithm, and a heuristics factor is introduced into mutation to speed up the convergence of algorithm. Through experiments in two different environments, it has been shown that our proposed algorithm has the advantages of path security and fast convergence in the path planning of single UAV. In the future work, we will concentrate on reducing time consumption of path planning for UAV and 3D environment will be introduced with dynamic obstacles.