Keywords

1 Introduction

UAV cluster operation is a complex multitask system that is affected and restricted by UAV’s functional performance, load, information context, support equipment and other factors. Dynamic regulation of UAV cluster operation involves strategic training and optimization at the stage of gathering in takeoff, formation in flight, transformation of flight form, dispersion and collection of multiple UAVs of different types and structures [1,2,3,4,5].

Traditional training approaches, in general, include linear planning, dynamic planning, branch-and-bound method, elimination method and other conventional training optimization approaches frequently adopted in operational researches. In the application of optimal approach, the problem is often simplified for the sake of mathematical description and modeling so as to formulate an optimized regulation scheme [6,7,8,9]. UAV’s specification, load, ammunition, coverage of support resources and battlefield context are all critical factors that impose influences in UAV cluster operation. Coupled with randomness and dynamicity in operation and a number of re-regulation tasks, UAV cluster operation is a NP-hard problem, characterized with great difficulties in solution, long time spent on solution and inability to realize simultaneous online UAV strategic cluster training [10,11,12,13].

The paper focuses on how to realize interactive learning, obtain knowledge and improve operation strategies in cluster operation assessment system so as to adapt to the environment and reach the ideal purpose [14,15,16,17]. UAV-borne computer is not informed of subsequent act, and it can only make judgment by trying every movement. Remarkably featured by trial-and-error-based search and delayed reward, reinforcement learning realizes enhancement of optimal actions and optimal movement strategies by making judgment on the environment’s feedback about actions and instructing subsequent actions based on assessment. In this way, UAV cluster operation can better adapt to the environment [18].

2 UAV Cluster’s Task Allocation and Modeling

Task allocation in UAV cluster operation aims at confirming UAV cluster’s target and attack task, designing the path and realizing maximal overall performances and least cost in cluster attack. This is a multi-target optimization problem with numerous restrictions. Optimization indicators include: maximal value return of target, maximal coverage of target, minimal flight distance and least energy consumption. Restrictions include a certain definition ratio of target surveillance, necessity of target’s location within UAV platform’s attack diameter, limitations of each prohibition/flight avoidance zone and a certain number of UAV platforms that surveil the same target (no exceeding the limit), etc. [19].

Considering the features of multiple goals and restrictions, based on multi-target optimization theory, the paper sets up an overall multi-target integral planning model concerning UAV cluster’s automatic task allocation.

Set the number of UAV platforms and targets as Nv and NT, respectively. Decision-making variable in design is: \( x_{i,j} \in \left\{ {0,1} \right\}, i = \left\{ {1,2, \ldots ,N_{\text{T}} } \right\}. \) 1 means that UAV platform i targets at j, and it is the opposite in case of 0. Therefore, mathematical model of task allocation in UAV cluster attack is established.

Target function

  1. (1)

    Cluster’s least flight time f1

One important indicator in UAV cluster’s attack task allocation is “least time of UAV cluster’s task implementation as much as possible,” i.e., realizing shortest path of UAV platform allocated with a task.

$$ P_{{{\text{Fix}}M_{i} }} = \left\{ {\begin{array}{*{20}l} {{\text{e}}^{{\frac{ - R}{{R_{h} }}}} ,} \hfill & {R \le R_{h} } \hfill \\ {0,} \hfill & {R > R_{h} } \hfill \\ \end{array} } \right. $$
(2.1)

Thereinto, \( M_{i} \) is the total number of targets allocated to UAV platform i.

  1. (2)

    Cluster’s total flight time f2

Another important indicator in UAV platform’s allocation of collaborated attack is “least cost as much as possible,” in which energy consumption is the key factor. Energy consumed in flight is related to flight distance and time. The less the time is, the less the energy is consumed. Thus, indicator function is about the shortest flight time of UAV cluster.

$$ \hbox{min} \,f_{2} = \mathop \sum \limits_{i = 1}^{{N_{v} }} \mathop \sum \limits_{k = 1}^{{M_{i} }} T_{k} $$
(2.2)

In order to ensure accuracy and feasibility of planned result when calculating the quantity of fuel burning/battery energy, initial path planning is indispensable, so is the calculation of energy consumption along initially planned path.

  1. (3)

    Maximization of target’s value f3

Maximization of target’s value is also an important indicator in task allocation on UAV platform. Under the condition of adequate attacking force, it is critical to realize maximization of target’s values. The indicator function is as below:

$$ \hbox{max} \,f_{3} = \mathop \sum \limits_{i = 1}^{{N_{v} }} \mathop \sum \limits_{k = 1}^{{M_{i} }} V_{\text{k}} $$
(2.3)

Thereinto, \( V_{\text{k}} \) is the value of target K.

  1. (4)

    Maximal target coverage f4

Concerning task allocation on UAV platform, in addition to maximization of target’s value, maximization of target coverage is also of great importance. Indicator function is as below:

$$ \hbox{max} \,f_{4} = \frac{{\mathop \sum \nolimits_{i = 1}^{{N_{v} }} M_{i} }}{{N_{t} }} $$
(2.4)

In fact, UAV cluster’s attack task allocation is about multi-target integral planning. With regard to multi-target planning, each target’s relative weight can be confirmed according to decision-making intention, and multi-target integral planning can be transited to single-target planning:

$$ \hbox{min} \,f = \gamma_{1} \left( {\alpha_{1} f_{1} + \alpha_{2} f_{2} } \right) + \gamma_{2} \left( {\beta_{1} \left( {1 - f_{3} } \right) + \beta_{2} \left( {1 - f_{4} } \right)} \right) $$
(2.5)

Thereinto, \( 0 \le \gamma_{1} ,\gamma_{2} \le 1, \gamma_{1} + \gamma_{2} = 1 \)

$$ \begin{aligned} & 0 \le \alpha_{1} ,\alpha_{2} \le 1, \alpha_{1} + \alpha_{2} = 1 \\ & 0 \le \beta_{1} ,\beta_{2} \le 1, \beta_{1} + \beta_{2} = 1 \\ \end{aligned} $$

f1, f2 and f3 are all normalized. Different valuations of \( \alpha_{1} , \) \( \alpha_{2} \), \( \beta_{1} \), \( \beta_{2} \), \( \gamma_{1} \) and \( \gamma_{2} \) reflect preference of decision-making. Adjustment can be conducted automatically in accordance with the algorithm. For example, when attacking force is adequate, in general, \( \beta_{1} \) = 0, \( \beta_{2} \) = 1; while when attacking force is inadequate, \( \beta_{1} = 0.5 \), \( \beta_{2} = 0.5 \).

Restrictions

  1. (1)

    Number of UAV platforms

The restriction imposes limitations on the number of attacking UAVs on UAV platform. The number should not exceed the total number of UAVs to be allocated with tasks on the platform.

$$ \mathop \sum \limits_{j = 1}^{{N_{1} }} x_{i,j} \le N_{v} $$
(2.6)
  1. (2)

    Attack diameter

When cluster on UAV platform executes attack tasks, diameter should be the primary restriction to be taken into consideration, i.e., flight diameter on UAV platform must be within the attack diameter.

$$ x_{i,j} l_{{i,j_{1} }} + \mathop \sum \limits_{k = 2}^{{M_{i} }} x_{{i,j_{k} }} l_{{j_{k - 1} j_{k} }} , j_{k} \le D_{i} $$
(2.7)

\( l_{{j_{k - 1} j_{k} }} \) refers to the distance between UAV platform and the first allocated target. \( l_{{j_{k - 1} j_{k} }} \), \( j_{k} \) refers to the distance between successive two targets allocated by UAV platform. Di refers to UAV platform’s attack diameter \( j = \left\{ {1, \ldots ,M_{i} } \right\} \).

Attack height

When cluster on UAV platform executes attack height, height should be the primary restriction to be taken into consideration, i.e., flight height on UAV platform must be within the attack height.

$$ x_{i,j} H_{j} \le H_{i} $$
(2.8)

\( H_{j} \) refers to the height of target j; \( H_{i} \) refers to the ascending limit of UAV platform i.

  1. (4)

    Attack force

Target allocated to each UAV platform shall not exceed its attack force.

$$ \mathop \sum \limits_{k = 1}^{{M_{i} }} x_{{i,j_{k} }} \le {\text{Attack}}_{i} $$
(2.9)

\( {\text{Attack}}_{i} \) is each UAV platform’s attack load, i.e., the maximal number of tasks to be executed.

3 Optimization of Operation Strategic Training Modeling

One key assumption is that UAV cluster’s operation strategy is based on the interaction between UAV and environment. Such an interaction can be regarded as MDP (Markov decision process). Optimization of operation strategy based on reinforcement learning can adopt solutions to Markov problems. Supposing that the system is observed at time \( t = t_{1,} t_{2,} \ldots ,t_{n} \), one operation strategy’s decision process is composed of five elements:

$$ \left\langle {S,A\left( S \right),P_{ss}^{a} ,R_{ss}^{a} ,V\left| {S,S^{\prime} \in S,\alpha \in A\left( S \right)} \right.} \right\rangle $$
(3.1)

Each element’s meaning is as below:

  1. S is the non-empty set composed of all possible states of the system. Sometimes, it is also called “system’s state space,” which can be a definite, denumerable or arbitrary non-empty set. S, S′ are elements of S, implying the states:

  2. \( s \in S \), A(s) is the set of all possible movements at states.

  3. When system is at the states at decision-making time t, after executing decision a, the probability of system’s S′ state at the next decision-making point t + 1 is \( P_{ss}^{a} \). The transition probability of all movements constitutes one transition matrix cluster:

    $$ P_{ss}^{a} = { \Pr }\left\{ {S_{t + 1} = s^{\prime}|S_{t} = S,a_{t} = a} \right\} $$
    (3.2)
  4. When system is at the states at decision-making time t, after executing decision a, the immediate return obtained by the system is R. This is usually called “return function”:

    $$ R_{ss}^{{a{\prime }}} = E\left\{ {r_{t + 1} |S_{t} = S, a_{t} = a,s_{t + 1} = s^{\prime}} \right\} $$
    (3.3)
  5. V is criterion function (or objective function). Common criterion functions include: expected total return during a limited period, expected discounted total return and average return. It can be a state value function or state-movement function.

Based on the characteristics of UAV cluster’s operation, UAV, battlefield’s environment, load platform, information resources and other supporting resources are segmented in the establishment of Markov state transition model (MDP). Thereinto, MDP model’s state variables include usability of platform and load, position of UAV, priority of multi-tasks and residual quantity of fuel. Usability refers to whether UAV and load are usable or not, and residual usability, whether takeoff/recycling equipment is usable or not. Variables of position state include current position, position of takeoff/recycling equipment and position of 4D track of task execution in sky. Priority is used to regulate UAVs’ takeoff and landing sequence as well as priority of multiple tasks. Residual quantity of fuel can be classified into multiple levels to judge the order of UAVs’ landing and priority of tasks. Finally, MDP model’s state space can be obtained:

$$ S = \mathop \prod \limits_{i\epsilon A} B_{i} \times \mathop \sum \limits_{{j\epsilon {\text{pos}}}} {\text{PR}}_{m} \mathop \prod \limits_{i\epsilon A} B_{i} \times \mathop \sum \limits_{{n\epsilon {\text{level}}}} {\text{LE}}_{n} $$
(3.4)

Thereinto, B refers to equipment’s usability; A is the set of equipment; PS is UAV’s position state; pos is the set of position states; PR is the priority of tasks; pri is the set of priority tasks; LE is the level of residual fuel; level is the set of levels of residual fuel.

By setting up a return function, as convergence conditions in learning, an optimized regulatory strategy can be generated by Q learning approach. The change of states corresponding to regulatory strategies is UAV regulation plan.

4 Research on UAV Cluster’s Operation Strategy Algorithm

The purpose of reinforcement algorithm is to find out one strategy π so that the value of every state Vπ(S) or (Qπ(S)) can be maximized, that is:

Find out one strategy π:S → A so as to maximize every state’s value:

$$ V^{\pi } \left( S \right) = E\{ r_{1} + \gamma r_{2} + \ldots + \gamma^{i - 1} r_{i} + \cdots |s_{0} = s\} $$
(4.1)
$$ Q^{\pi } \left( {s,a} \right) = E\{ r_{1} + \gamma r_{2} + \ldots + \gamma^{i - 1} r_{i} + \cdots |s_{0} = s,a_{0} = a\} $$
(4.2)
$$ v^{*} \left( s \right) = \mathop {\hbox{max} }\limits_{\pi } \left( {v^{\pi } \left( s \right)} \right) $$
(4.3)
$$ Q^{*} \left( {s,a} \right) = \mathop {\hbox{max} }\limits_{\pi } \left( {Q^{\pi } \left( {s,a} \right)} \right) $$
(4.4)

Thereinto, means immediate reward at time t. \( \gamma \left( {\gamma \epsilon \left[ {0.1} \right]} \right) \) is attenuation coefficient. \( V^{\pi } \left( S \right) \) and \( Q^{\pi } \left( {s,a} \right) \) are optimal value functions. Corresponding optimal strategy is:

In reinforcement learning, if optimal value function \( Q^{*} \left( {V^{*} } \right) \) has been estimated, there are three movement patterns for option: greedy strategy, \( \upvarepsilon \)-greedy strategy and softmax strategy. In greedy strategy, movements with highest Q value are always selected, i.e., \( \pi^{*} = \arg \hbox{max} Q^{*} \left( {s,a} \right) \). In \( \upvarepsilon \)-greedy strategy, movements with highest Q value are selected under a majority of conditions, and random selections of movements occur from time to time so as to search for the optimal value. In softmax strategy, movements are selected according to weight of each movement’s Q value, which is usually realized by Boltzmann machine. In the approach, the movements with higher Q value correspondingly have higher weight. Thus, it is more likely to be selected.

The mechanism of all reinforcement learning algorithms is based on the interaction between value function and strategy. Value function can be made use of to improve strategy; value function learning can be realized and value function can be improved by making use of assessment of strategy. In such an interaction in reinforcement learning, UAV cluster’s operation strategy gradually concludes optimal value function and optimal strategy.

5 Conclusion

At present, there are mainly two types of reinforcement learning algorithms to solve UAV cluster’s independent intelligent operation problems: first, value function estimation, which is adopted in reinforcement learning researches mostly extensively with quickest development; second, direct strategy space search approach, such as genetic algorithm, genetic program design, simulated annealing and other evolution approaches.

Direct strategy space search approach and value function estimation approach can be both used to solve reinforcement learning problems. Both improve Agent strategy by means of training learning. However, direct strategy space search approach does not use strategy as a mapping function from state to movement. Value function is not taken into consideration. In another word, environment status is not taken into consideration.

Value function estimation approach concentrates on value function; that is, environment status is regarded as a core element. Value function estimation approach can realize learning based on the interaction between Agent and environment, regardless of quality of strategy—either good or bad. On the contrary, direct strategy space search approach fails to realize such a segmented gradual learning. It is effective for reinforcement learning with enough small strategy space or sound structure that facilitates easiness to find out an optimal strategy. In addition, when Agent fails to precisely perceive environment status, direct strategy space search approach displays great advantages. By modeling UAV cluster’s independent intelligent operation, we can find that value function estimation is better for large-scale complex problems because it can make better use of effective computing resources and reach the goal of solving UAV cluster’s independent intelligent operation.