Reinforced Learning-Based Robust Control Design for Unmanned Aerial Vehicle

Din, Adnan Fayyaz Ud; Mir, Imran; Gul, Faiza; Al Nasar, Mohammad Rustom; Abualigah, Laith

doi:10.1007/s13369-022-06746-0

Reinforced Learning-Based Robust Control Design for Unmanned Aerial Vehicle

Research Article-Computer Engineering and Computer Science
Published: 24 March 2022

Volume 48, pages 1221–1236, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Reinforced Learning-Based Robust Control Design for Unmanned Aerial Vehicle

Download PDF

Adnan Fayyaz Ud Din¹,
Imran Mir²,
Faiza Gul²,
Mohammad Rustom Al Nasar³ &
…
Laith Abualigah^4,5

1271 Accesses
18 Citations
Explore all metrics

Abstract

Innovation in UAV design technologies over the last decade and a half has resulted in capabilities that flourished the development of unique and complex multi-mission capable UAVs. These emerging new distinctive designs of UAVs necessitate development of intelligent and robust Control Laws which are independent of inherent plant variations besides being adaptive to environmental changes for achieving desired design objectives. Current research focuses on development of a control framework which aims to maximize the glide range for an experimental UAV employing reinforcement learning (RL)-based intelligent control architecture. A distinct model-free RL technique, abbreviated as ‘MRL’, is suggested which is capable of handling UAV control complications while keeping the computation cost low. At core, the basic RL DP algorithm has been sensibly modified to cater for the continuous state and control space domains associated with the current problem. Review of the performance characteristics through analysis of the results indicates the prowess of the presented algorithm to dynamically adapt to the changing environment, thereby making it suitable for complex designed UAV applications. Nonlinear simulations carried out under varying environmental conditions illustrated the effectiveness of the proposed methodology and its success over the conventional classical approaches.

Implementation of partially tuned PD controllers of a multirotor UAV using deep deterministic policy gradient

Article Open access 19 July 2024

Modified model free dynamic programming :an augmented approach for unmanned aerial vehicle

Article 20 May 2022

Design of Reinforcement Learning Environment for Multiple UAV Area Coverage Search

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

UAVs are one of the most rapidly expanding and active divisions of the aviation business [1,2,3,4,5,6]. Unmanned aerial vehicles (UAVs) are useful in a variety of situations, such as search and rescue, monitoring, and exploration. As a result, UAVs need to be able to detect their trajectory quickly and accurately, especially in emergency situations or in a congested environment [7,8,9,10]. They are used in non-military applications such as search and rescue/health care, disaster management, journalism, shipping, engineering geology, and so on [11,12,13,14,15,16]. The demand is enormous and will continue to grow as new technologies become available. UAVs can also be effective with Internet of things (IoTs) components when used to perform sensing activities. UAVs, on the other hand, operate in a dynamic and uncertain environment due to their great mobility and shadowing in air to-ground channels. As a result, UAVs must increase the quality of their sensing and communication services without having compromising comprehensive information; therefore, reinforcement learning is a good fit for the cellular Internet of UAVs [17].

UAV models are now being developed in quite a large number and are acting as an indispensable aid to human operators in a wide range of military and civilian applications [7]. As a result, the fast growing fleet of UAVs, as well as the broadening scope of their applications, poses a severe challenge to designers. The development of hi-fidelity systems was aided by technological improvements in the aviation [18,19,20,21] and ground transportation sectors [22,23,24].

Linear and nonlinear control systems have been utilized to solve a variety of control problems and obtain desired outcomes [4, 5, 5, 9, 10, 20]. However, a thorough knowledge of these methodologies’ inherent limitations became the driving force behind developing an intelligent system capable of making optimal, sequential decisions for a complex control situation.

Intelligent technologies, grouped under the banner of machine learning (ML), have begun to show promising results in resolving previously thought-to-be-impossible domain. Researchers are exploring various algorithms while altering the application of optimal control theory in new and unique ways, thanks to tremendous advancements in computational technology [25,26,27,28,29]. Events and their effects are reinforced by the actions taken in RL inspired by human and animal behavior [30]. At its core, RL [31] has an agent that acquires experience through trial and error as a result of its interactions with a specific environment, thus enhancing its learning curve. The agent is completely unaware of the underlying system and its ability to be controlled [32]. However, it recognizes the concept of a reward signal (as shown in Figure 1 on which the next decision is made). During the training phase, the agent learns about the best actions to take based on the reward function. The trained agent selects actions that result in the biggest rewards in order to attain optimal task performance.

As the system dynamics change or the environment transforms, the reward signal optimizes as well, and the agent alters its action policy to get bigger rewards. RL has baggage connected to the safety of its activities during the exploration phase of its learning, despite the aforementioned facts, indicating that it is a powerful tool to be used in control problems [33,34,35,36]. Control system design based on intelligent techniques is deemed most appropriate to cope with the rising complexity of system dynamics and management of complicated controls for enhancing flexibility with the changing environment [37].

In recent studies, deep RL has been applied employing deep deterministic gradient policy (DDGP), trust region policy optimization (TRPO), and proximal policy optimization (PPO) algorithms for conventional quadcopters only, primarily focusing on controlling some specific phases of flight-like attitude control [38, 39] or compensating disturbances, with PPO outperforming others [40]. Further similar studies have been discussed in relevant studies section. However, the goal of the current study is different from these mentioned researches as it aims to provide an RL-based control system for an experimental UAV which has an unusual design and is under-actuated with respect to controls, making its control challenging in the continuous state and action domains.

1.1 Relevant Studies

Xiang et al. [41] presented the learning algorithm which is capable of self-learning. The technique is being studied and developed in particular for cases where the reference trajectory is either overly aggressive or incompatible with system dynamics. A numerical analysis is undertaken to confirm the suggested learning algorithm’s effectiveness and efficiency, as well as to exhibit improved tracking and learning performance.

Zhang et al. [42] introduced geometric reinforcement learning (GRL), for path planning of UAVs. The authors presented that GRL can make the following contributions: a) For path planning of many UAVs, GRL uses a special reward matrix, which is simple and efficient. The candidate points are chosen from a region along the geometric path connecting the current and target sites. b) The convergence of computing the reward matrix has been theoretically demonstrated, and the path may be estimated in terms of path length and risk measure. c) In GRL, the reward matrix is adaptively updated depending on information shared by other UAVs about geometric distance and risk. Extensive testing has confirmed the usefulness and feasibility of GRL for UAV navigation.

Jingzhi Hu et al. [43] integrated UAV with Internet of things. They presented a distributed sense-and-send mechanism for UAV sensing and transmission coordination. Then, in the cellular Internet of UAVs, an integration of reinforcement learning added to handle crucial challenges like trajectory control and resource management.

For conventional UAVs, onboard flight control system (FCS) based upon linear control strategies with well-designed closed-loop feedback linear controls has yielded satisfactorily results [9, 44,45,46,47]. Posawat designed cascaded PID controllers [44] with automatic gain scheduling and controller adaption for various operating conditions. However, the control architecture was incapable of adapting to environmental disturbances and was highly dependent on sensor accuracy. Oualid [46] utilized two different linear control techniques for controlling UAV dynamics. Linear quadratic servo (LQ-Servo) controller based on ${\varvec{L}_2}$ and $\varvec{L}_{\infty }$ norms was developed. Results, however, showed limited robustness to external disturbances, particularly to wind gusts. Further, Doyle et al. [48] utilized H-1 loop shaping in connection with $\mu $-synthesis, while Kulcsár [49] utilized linear quadratic regulator (LQR) architecture for the control of UAV. Both schemes satisfactorily manage the requisite balance between robustness and performance of the devised controller. But both these linear methods, besides being mathematically intricate, lose their effectivity with increasing complexity and nonlinearity of the system.

Realizing the limitations of linear control and evolving enhanced performance requirements of UAVs, researchers gradually resorted to applying nonlinear techniques to make the controllers more adaptive and responsive to changing scenarios. Methodologies such as back-stepping sliding mode control (SMC), nonlinear dynamic inversion (NDI), and incremental nonlinear dynamic inversion (INDI) have emerged to be strong tools in handling uncertainties and nonlinearities satisfactorily, besides having the potential to adapt to changing aircraft dynamics in connection with the evolving environment. Escareno [50] designed nonlinear control for attitude control of a quadcopter UAV using nested saturation technique . Results were experimentally verified. However, the control lacked measures for performance control in a harsh environment. In another work, Derafa [51] implemented a nonlinear control algorithm for a UAV incorporating back-stepping sliding mode technique with adaptive gain. The authors have successfully kept the chattering noise low because of the sign function pronounced in fixed gain controllers. Experimental results of UAV showed acceptable performance with regards to stabilization and tracking. However, the algorithm was computationally expensive.

Understanding of inherent limitations of linear [44,45,46] and nonlinear control techniques [52] along to achieve autonomy in controls for complex aerospace systems provoked researchers to look for intelligent methods [53]. Under the ambit of ML, RL-based algorithms [54] have emerged as an effective technique for the design of autonomous intelligent control [55, 56]. Coupled with neural nets, RL-based algorithms have emerged as a robust methodology in solving complex domain control problems, which significantly overpowers the contemporary linear and nonlinear control strategies. Further, with the computer’s increasing computation power, state-of-the-art RL algorithms have started to exhibit promising results. Due to its highly adaptive characteristics, RL has increasingly found use in aerospace control applications for platforms like aircraft, missile trajectory control, fixed wing UAVs, etc.

Kim et al. [57], in their work for flat spin recovery for UAV, utilized RL-based intelligent controllers. Aircraft nonlinearities were handled near the upset region in two phases as ARA (angular rate arrest) and UAR (unusual attitude recovery) using DQN (Q-learning with ANN (artificial neural network)). Dutoi [58], in similar work, has highlighted the capability of the RL framework in picking the best solution strategy based on its offline learning, which is especially useful in controlling UAV in harsh environments and during flight-critical phases. Wickenheiser [59] exploited vehicle morphing for optimizing the perching maneuvers to achieve desired objectives. In another study, Novati [60] employed deep RL for gliding and perching control of a two-dimensional elliptical body and concluded that model-free character and robustness of deep RL suggest a promising framework for developing mechanical devices capable of exploiting complex flow environments. Krozen in his research [61] has implemented reinforcement learning as an adaptive nonlinear control.

Based on our review of the related research and cited papers, it has been assessed that application of RL, especially deep RL for continuous action and state domains, is limited to complex yet straightforward tasks of balancing inverted pendulums, legged and bipedal robots [62], various board and computer games by effectively implementing a novel mix approach of both supervised and deep RL [63, 64]. Implementation of RL-based control strategy with continuous state & action spaces for developing Flight Controls of UAVs have not been applied on the entire flight regime. It has been used only for handling critical flight phases [57] where linear control theory is difficult to implement and for navigation of UAVs [65, 66]. Moreover, the in-depth analysis of the results shows slightly better performance by eliminating overshoots besides tracking a reference heading compared to a well-tuned PID controller However, it still lacked the required accuracy as was anticipated. Further, Rodriguez-Ramos et al. [67] successfully employed deep RL for autonomous landing on a moving platform again, just focusing on the landing phase. Considering the immense potential of RL algorithms and their limited application in entirety for UAV flight control systems development, it is deemed to be mandatory to explore this dimension.

1.2 Research Contributions

In this research, we explore the efficacy of RL algorithm for an unconventional UAV. The RL-based control strategy is formulated with continuous state and control space domains that encompass the entire flight regime of the UAV, duly incorporating nonlinear dynamical path constraints. An unconventional UAV designed with the least number of control surfaces has been used to reduce the overall cost. This distinctive UAV design resulted in an under-actuated system, thus making the stability and control of the UAV prominently challenging.

A novel RL-based algorithm named as MRL has been devised. The algorithm has been specifically modified to achieve the desired objective of range enhancement while keeping the computational time required for learning the agent minimal, making it suitable for the practical onboard application. The designed control framework optimized the range of the UAV without explicit knowledge of the underlying dynamics of the physical system. Developed RL control algorithm learns offline based on reward function formulated after each iteration step. Control algorithm in line with the finalized reward function autonomously ascertains the optimum sequence of the available deflections of control surfaces at each time step $(0.2 \ sec)$ to maximize UAV range.

Vehicle’s six-degree-of-freedom (DoF) model is developed, registering its translational and rotational dynamics. The results from two developed algorithms are compared and analyzed. Simulation results show that apart from improved circular error probable (CEP) of reaching the designated location, the range of UAV has also significantly increased with the proposed RL controller. Based on promising results, it is evidently deduced that RL has immense potential in the domain of intelligent controls for future progress because of its capability of adaptive, real-time sequential decision-making in uncertain environments.

2 Problem Setup

2.1 UAV Geometric and Mass parameters

Geometrical parameters of an experimental UAV (refer Figure2) utilized in this research are selected to meet the mission requirements. The UAV has a mass of 596.7 kg, wing area of 0.865$m^2$ with mean aerodynamic chord 0.2677m, and a wing span of 1.25m. The UAV has a wing–tail configuration with unconventional controls which consist of two all-moving inverted V tails to function as ruddervators. These control surfaces can move symmetrically to control pitch motion and differentially for coupled roll and yaw movements. An additional ventral fin is also placed at the bottom side for enhancing lateral stability

2.2 UAV Mathematical Modeling

In current research, the flight dynamics modeling is carried out utilizing 6-DOF [9] model, which is typically utilized to model the vehicle motion in 3D space [9]. Assuming flat non-rotating Earth, equations are defined as follows:

$$\begin{aligned} \dot{U} =&RV - QW -g\sin \theta +\frac{X_A}{m}\nonumber \\ \dot{V} =&-RU +PW +g\sin \phi \cos \theta +\frac{Y_A}{m}\nonumber \\ \dot{W} =&QU- PV +g\cos \phi \cos \theta +\frac{Z_A}{m} \end{aligned}$$

(1)

$$\begin{aligned} \varGamma \dot{P} =&J_{XZ}(J_X-J_Y+J_Z)PQ-[J_Z(J_Z-J_Y)\nonumber \\&+J^2_{XZ}]QR +J_Zl+J_{XZ}n\nonumber \\ \varGamma \dot{Q} =&(J_Z-J_X)PR-J_{XZ}(P^2-R^2)+m\nonumber \\ \varGamma \dot{P} =&[J_{X}(J_X-J_Y)+J^2_{XZ}]PQ\nonumber \\&- J_{XZ}(J_X-J_Y+J_Z)QR + J_{XZ}l+J_Xn \end{aligned}$$

(2)

$$\begin{aligned} \dot{\phi } =&P+\tan \theta (Q\sin \phi +R\cos \phi )\nonumber \\ \dot{\theta } =&Q\cos \phi -R\sin \phi \nonumber \\ \dot{\psi } =&\frac{Q\sin \phi +R\cos \phi }{\cos \theta } \end{aligned}$$

(3)

$$\begin{aligned} \dot{P_E} =&U \cos \theta \cos \psi +V (-\cos \phi \sin \psi +\sin \phi \sin \theta \cos \psi )\nonumber \\&+W(\sin \phi \sin \psi +\cos \phi \sin \theta \cos \psi )\nonumber \\ \dot{P_N} =&U\cos \theta \sin \psi +V(\cos \phi \cos \psi +\sin \phi \sin \theta \sin \psi ) \nonumber \\&+W(-\sin \phi \cos \psi +\cos \phi \sin \theta \sin \psi )\nonumber \\ \dot{h} =&U\sin \theta -V\sin \phi \cos \theta -W\cos \phi \cos \theta \end{aligned}$$

(4)

In the above equations, it is noteworthy that the thrust terms have been removed from the force equations (1) as the UAV has no onboard thrust generating mechanism. P,Q,R and U,V,W represent angular velocity and linear components along body x-, y- and z-axes, respectively. Euler angles are defined as $\phi , \theta \,\,and \,\, \psi $ representing orientation of UAV with respect to the inertial frame. Position coordinates along the inertial north and east directions are defined as $P_n$ and $P_e$, whereas vehicle altitude is described by h. $X_A,Y_A,Z_A$ are the body axis forces, and moments are represented by l,m,n. Moment of inertia matrix is given by J, and $J_x$, $J_y$, $J_z$ are the moments of inertia about the x-, y-, and z-axes, respectively. $J_{xy}$, $J_{yz}$, and $J_{zx}$ are the cross-products of inertia.

The problem was formulated as a nonlinear system defined as Eq. (5):

$$\begin{aligned} \varvec{\vec {\dot{x}}}= \varvec{f}(\varvec{\vec {x}},\varvec{\vec {u}}) \end{aligned}$$

(5)

In the above equation, $\varvec{\vec {x}} \in \mathbb {R}^{12}$ represents the state vector, control vector is $\varvec{\vec {u}} \in \mathbb {R}^{2}$, and fresh state estimates are represented as $\varvec{\dot{\vec {x}}} \in \mathbb {R}^{12}$. The state vector in body and wind axis is defined by Eq. (6) and Eq. (7), respectively.

$$\begin{aligned} \vec {\varvec{x}}= & {} [U, \, V, \, W, \, \phi , \, \theta , \, \psi , \, P, \, Q, \, R, \, h, \, P_N, \, P_E]^T, \,\,\, \vec {\varvec{x}} \in \mathbb {R}^{12}\nonumber \\ \end{aligned}$$

(6)

$$\begin{aligned} \vec {\varvec{x}}= & {} [V_T, \, \alpha , \, \beta , \, \phi , \, \theta , \, \psi , \, P, \, Q, \, R, \, h, \, P_N, \, P_E]^T, \nonumber \\&\vec {\varvec{x}} \in \mathbb {R}^{12} \end{aligned}$$

(7)

Control vector with continuous action space is defined in Eq. (8):

$$\begin{aligned} {\vec {\varvec{u}}= [LF,\,\,RF]^T, \,\,\, \vec {\varvec{u}} \in \mathbb {R}^2} \end{aligned}$$

(8)

Fresh state estimates are evaluated at each time step utilizing Eqs. (1-3).

Aerodynamic forces and moments acting on the aerial vehicle during different stages of the flight are governed by Eq. (9) and Eq. (10), respectively.

$$\begin{aligned} L= & {} q_{\infty }SC_L,\,\,D= q_{\infty }SC_D,\,\,Y= q_{\infty }SC_Y \end{aligned}$$

(9)

$$\begin{aligned} l_w= & {} q_{\infty }bSC_l, \,\,m_w= q_{\infty }cSC_m, \,\,n_w= q_{\infty }bSC_n \end{aligned}$$

(10)

where L, D, Y and $l_w $, $m_w$, $n_w$ represent aerodynamic forces (lift, drag, and side force) and moments (roll, pitch, and yaw) being used in the equations of motions, whereas $C_L, C_D, C_Y$ and $C_l, C_m, C_n$ are the dimensionless aerodynamic coefficients in wind axis for calculating forces and moments. $q_{\infty }$ is the dynamic pressure, whereas S is the wing area.

2.3 Aerodynamic Evaluation

The aerodynamic body force and moment coefficients in Eq. (9) and Eq. (10) vary with the flight conditions and control settings. A high-fidelity aerodynamic model is necessary to determine these aerodynamic coefficients accurately. Current research utilizes both non-empirical (such as CFD [68] and USAF Datcom [69]) and empirical [70]) techniques to determine these coefficients. The generic high-fidelity coefficient model employed for aerodynamic parameter estimation is elaborated in Eq. (11):

$$\begin{aligned} \begin{aligned} C_i= C_{i,static} + C_{i,dynamic} \end{aligned} \end{aligned}$$

(11)

where $C_i$ = $C_L$, $C_D$, $C_Y$, $C_l$, $C_m$, and $C_n$ represent the coefficient of lift, drag, side force, rolling moment, pitching moment, and yawing moment, respectively.

The non-dimensional coefficients are usually obtained through linear interpolations using data obtained from various sources. Evaluation of static (basic) coefficient data (see Eq. (12)) is achieved utilizing computational fluid dynamics (CFD) [68, 71] technique and are conventionally a function of control ($\delta _{control})$, angle of attack ($\alpha $), side slip $(\beta $), and Mach number (M).

$$\begin{aligned} \begin{aligned} C_{i,static} (\alpha ,\beta , \delta _{control}, M)&\Rightarrow C_{D_b} (\alpha ,\beta , \delta _{control}, M),\\&C_{L_b}(\alpha ,\beta , \delta _{control}, M),\\&C_{Y_b}(\alpha ,\beta , \delta _{control}, M),\\&C_{l_b}(\alpha ,\beta , \delta _{control}, M),\\&C_{m_b}(\alpha ,\beta , \delta _{control}, M),\\&C_{n_b}(\alpha ,\beta , \delta _{control}, M) \end{aligned} \end{aligned}$$

(12)

where $C_{D_b}$, $C_{L_b}$, $C_{Y_b}$, $C_{l_b}$, $C_{m_b}$, and $C_{n_b}$ represent the basic components of the aerodynamic forces and moments as a function of ($\delta _{control})$, angle of attack ($\alpha $), side slip $(\beta $), and Mach number (M).

Similarly, dynamic component (Eq. (13)) consists of rate and acceleration derivatives which are evaluated again utilizing empirical [70] and non-empirical (‘USAF Stability and Control DATCOM’ [69]) techniques.

$$\begin{aligned} \begin{aligned} C_{i,dynamic}(\dot{\alpha }, \dot{\beta }, p, q,r) =&Rate \,\,derivatives \\&+Acceleration \,\,derivatives \end{aligned} \end{aligned}$$

(13)

Rate derivatives are the derivatives due to roll ($\textit{p}$) rate, pitch rate ($\textit{q}$), and yaw rate ($\textit{r}$), while acceleration derivatives are the derivatives due to change in the aerodynamic angles $(\dot{\alpha }, \dot{\beta })$. They are shown in Eq. (14) and Eq. (15), respectively.

$$\begin{aligned}&Rate \,derivatives \nonumber \\&\quad = (C_{L_q} , C_{D_q} , C_{m_q}) \nonumber \\&\qquad +(C_{Y_p} , C_{l_p} , C_{n_p}) + (C_{Y_r} , C_{l_r} , C_{n_r}) \end{aligned}$$

(14)

$$\begin{aligned}&Acceleration \, \, derivatives \nonumber \\&\quad = (C_{L_{\dot{\alpha }}}+C_{D_{\dot{\alpha }}} +C_{m_{\dot{\alpha }}} ) \nonumber \\&\qquad +(C_{Y_{\dot{\beta }}}+C_{l_{\dot{\beta }}} +C_{m_{\dot{\beta }}} ). \end{aligned}$$

(15)

3 MRL Framework

3.1 Introduction

Basic reinforcement learning algorithms are aimed at finding an optimal state-value function $\mathcal {V}\pi ^{*}$ or an action-value function $\mathcal {Q}\pi ^*$, while following a policy $\pi $ which is a time-dependent distribution over actions given states (16) and guides the choice of action at any given state.

$$\begin{aligned} \varvec{\pi (a|s)} = \varvec{\mathbb {P}[{A_t} = a|{S_t} = s]} \end{aligned}$$

(16)

State-value function is the expected return starting from state s, while following policy $\pi $ and gathering scalar rewards once transitioning between the states (17). The agent’s behavior is carefully controlled during the exploration phase so that maximum states are visited at least once during the course of learning. However, the action-value function is determined by the return that is accumulated by the agent being in any particular state s and taking action a (18).

$$\begin{aligned} \varvec{v_{\pi }(s)}= & {} \varvec{\mathbb {E}_{\pi }[{G_t}|{S_t} = s]} \end{aligned}$$

(17)

$$\begin{aligned} \varvec{q_{\pi }(s,a)}= & {} \varvec{\mathbb {E}_{\pi }[{G_t}|{S_t} = s,{A_t} = a]} \end{aligned}$$

(18)

Total reward of each episode $\mathcal {R}_{s}^{a}$ is defined as expectation of rewards at each step of the episode given state and action and is shown (19)

$$\begin{aligned} \varvec{\mathcal {R}_s^a} = \varvec{\mathbb {E}_{\pi }[{R_{t+1}}|{S_t} = s,{A_t} = a]} \end{aligned}$$

(19)

3.2 RL Algorithm Selection Challenge

The development of an appropriate RL algorithm corresponding to any problem is challenging as its implementation varies from the nature of problem in hand [72, 73]. Factors such as state (s) and action space (a) domain type (discrete or continuous), direct policy search $(\pi )$ or value function (v), model-free or model-based, and requirement for incorporation of neural nets (deep RL) are dictating parameters in formulation/selection of an appropriate algorithm.

Current research work problem is a complex nonlinear problem with mixed coupled controls. The problem has a $\varvec{12}$-dimensional state space and a $\varvec{2}$-dimensional action space, both of which are continuous. Realizing the complexity of the problem in hand due to continuous state and action space [74], a unique approach of MRL is employed which adapts to the desired requirements optimally.

3.3 Model-Free Reinforcement Learning (MRL) and RL Dynamic Programming (DP) Architecture

3.3.1 RL Dynamic Programming (DP)

RL DP algorithm employs Bellman’s principle of optimality [75] at its core. The optimality principle basically works by breaking a bigger complex problem into smaller subproblems and then solving each in a recursive manner, i.e., it optimizes subproblems and combines them to form an optimal solution [76,77,78,79]. The RL DP algorithm requires that the environment is a Markov decision process (MDP) whereby the environment model is known along with the state transition matrix. It performs full widths backups at each step (refer Figure 3), where every possible successor state and action is considered at least once. It computes the value of a state based on all possible actions a, resulting in all possible successor states $s'$ and all possible rewards. The RL DP algorithm evaluates values and action-value functions using Eq. (20).

$$\begin{aligned} \begin{aligned} \varvec{v_{\pi }(s)} =&\varvec{\sum _{a \in A}}\left( \varvec{\mathcal {R}_s^a} + \varvec{\gamma \sum _{s'\in \mathcal {S}}\mathcal {P}_{ss'}^av_{\pi }(s')}\right) , \\ \varvec{q_{\pi }(s,a)} =&\varvec{\mathcal {R}_s^a} + \varvec{\gamma \sum _{s'\in \mathcal {S}}\mathcal {P}_{ss'}^a\sum _{a' \in A}}\varvec{\pi (a'|s')q_{\pi }(s',a')} \end{aligned} \end{aligned}$$

(20)

The policy (set of good actions) which gives maximum reward as per the defined reward function is known as an Optimal Policy $\varvec{\pi _*}$ and is defined in Eq. (21):

$$\begin{aligned} \begin{aligned} \varvec{v_{*}(s)} =&\varvec{\max _{\pi }v_{\pi }(s)}, \\ \varvec{q_{*}(s,a)} =&\varvec{\max _{\pi }q_{\pi }(s,a)} \end{aligned} \end{aligned}$$

(21)

RL DP once configured optimally is ideally suited in situations where the state of the system is changing continuously over time and sequential decisions are required [80]. It sequentially improves the policy because every action being selected at each step maximizes the overall return.

3.3.2 MRL Framework

Devised new MRL algorithm in this research is a derivative of RL DP algorithm. However, in MRL a priori knowledge of the model parameters is not required. This effectively makes the proposed algorithm model-free. Further, the process of policy optimization is managed through the iterative development of an optimal reward function instead of a value function or action-value function only. This ensures that from the ab initio, optimal action is chosen at each time step [81, 82].

After the development of an optimal reward function, the proposed MRL algorithm, which is the improved and a model-free variant of RL DP, takes all the available actions into account one by one, while calculating reward for actions taken at every step of the algorithm. Then, among all the rewards accumulated for each action taken in a particular state, it characterizes the action with maximum reward as the optimal action as shown in Eq. (22). MRL algorithm is elaborated at Algorithm 1.

$$\begin{aligned} \varvec{V(S_t)} \xleftarrow {\text {max}} [\varvec{R_{t+1}} + \varvec{\gamma } \varvec{V(S_{t+1}}] \end{aligned}$$

(22)

This process of identifying optimal action at each step continues to ensure optimizing the entire trajectory starting from the initial launch conditions to the terminal stage.

Configuring the trajectory optimization problem in the MRL environment was challenging as it was difficult to accurately formulate the reward function, which fulfills the desired objectives optimally. Erroneously developed reward functions drive the agent to achieve non-priority goals and non-converging solution. Another manifested problem was the fact that the optimization process is inherently iterative. Arriving at the desired final reward function takes considerable time, which must be minimized. Lastly, the application of MRL for a complex problem based upon the continuous domain requires accurate discretization of the constituent domains. These need to be curtailed to ensure that the algorithm remains computationally viable.

3.3.3 MRL Controller Development Architecture

To make the above-stated MRL controller algorithm efficient, the associated action space was analyzed. With two actions $u \in \varvec{R^2}$ (i.e., LF and RF), the search space was segregated corresponding to deflection range of $\pm {10}^{\circ }$. The action space of each control was then discretized into $\varvec{50}$ equal spaces, making a total of $\varvec{2500}$ actions. This was primarily done to make the algorithm computationally acceptable. Then, scalar reward function was formulated for maximizing the glide range of the experimental glide UAV. An inherent penalization was introduced in the reward function. This ensured that if the platform sets of course from the desired state values during the learning phase, the reward will decrease as the penalty is deducted from the reward function.

Starting from initial conditions (launch conditions), the entire discretized action space was swept. A scalar reward for each action pair was calculated based on the finalized reward function. Action pair, which resulted in the best compensation for a specific given set of states, is chosen as optimal value of conditions and activities.

Subsequently, at the next step, for the chosen set of states, the same sequence of $\varvec{2500}$ actions is applied, and again an optimal action pair based on the highest reward is selected and stored along with the new set of states. This optimization process at every step of the process continues until the terminal state (when the experimental glide vehicle hits the ground with the employed condition of z is less than or equal to zero in the algorithm) is reached. It is noteworthy that the optimal action corresponding to maximum reward was being taken at every step, so the entire trajectory was optimal. The results are discussed in Sect. 4.

4 Results and Discussion

Results obtained from the suggested MRL algorithm which is a variant of RL DP algorithm are discussed here. Variation of all the 12 states during the glide phase of an experimental UAV as mentioned in Eq. (7) has been plotted against the episodic steps. The simulation time step after test and trial is kept as 0.1 secs as it adequately captures the quantum of change of states yielding optimum results for the entire state space. The initial launch conditions for the gliding vehicle are specified in Table 1

Table 1 Initial launch conditions

Full size table

4.1 MRL Controller Results

The initial reward function formulated for the MRL controller is depicted in Equation (23).

$$\begin{aligned} \begin{aligned} \varvec{pny}&= \varvec{|P|} + \varvec{|Q|} + \varvec{|R|} \\ \varvec{r}&= \varvec{xpos} \\ \varvec{Rew}&= \varvec{r} - \varvec{pny} \\ \end{aligned} \end{aligned}$$

(23)

where $\varvec{pny}$ represents the penalty defined at each step of the simulation; $\varvec{r}$ is a scalar value based on increasing $\varvec{xpos}$ which is the incremental current $\varvec{x}$ value or the gliding distance covered. At first, only three states corresponding to body rates were included in the cost function. Simulation carried out utilizing this initially formulated reward function showed body rates exploding just after $\varvec{300}$ episodic steps (refer Figures 4, 5, 6) while only achieving approximately $\varvec{19 \, kms}$ of range, as shown in Figure 7. It is noteworthy that roll and yaw rates are excessively high, thus showing platforms instability in the roll and yaw dynamics along with their inherent strong coupling due to unconventional design of the UAV.

Analysis of the previous results necessitated for more stringent control of the body rates. Therefore, next iteration focused on adding variable weightages to the rates in order to control them (Eq. (24) efficiently. The addition of the weights primarily aimed at keeping the penalty low. This focused effort resulted in increasing the reward also which is evident through the increase in glide range. However, once again, the rates started to grow, surpassing the anticipated tolerance range and causing instability.

$$\begin{aligned} \begin{aligned} \varvec{pny}&= \varvec{n_1|P|} + \varvec{n_2|Q|} + \varvec{n_3|R|} \\ \varvec{r}&= \varvec{xpos} \\ \varvec{Rew}&= \varvec{r} - \varvec{pny} \\ \end{aligned} \end{aligned}$$

(24)

After continuous thought process, besides the rates, quantum of change in rates was now targeted and a new reward function was formulated as mentioned in Eq. (25).

$$\begin{aligned} \varvec{pny} = \varvec{n_1|P|} + \varvec{n_2|Q|} + \varvec{n_3|R|} + \varvec{\varDelta P} + \varvec{\varDelta Q} + \varvec{\varDelta R} \end{aligned}$$

(25)

where $\varDelta $ in the reward function represents the state change.

Analysis of the initial results of this new structure reveals that the rates remained controlled for increased time steps and the range slightly enhanced to $\varvec{22 \, kms}$ as evident in Figure 16; however, rates blew up in between shown (refer Figures 8, 9, 10)

Next, once again previous weightages were re-tuned and variable weightages were added to the change in rates of reward function. It is meaningful to highlight here that because of the excessive nonlinearity associated with the experimental vehicle based on its peculiar design, roll and yaw rates were specially focused as shown in Eq. (26).

$$\begin{aligned} \varvec{pny}= & {} \varvec{n_1|P|} + \varvec{n_2|Q|} + \varvec{n_3|R|} + \varvec{n_4\varDelta P} \nonumber \\&+\varvec{n_5\varDelta Q} + \varvec{n_6\varDelta R} \end{aligned}$$

(26)

Although the rates controllability was achieved for a longer duration (refer Figures 12, 13, 14), the vehicle remained unstable (Figure 15) with range enhancement to about 30 kms (Figure 16).

The increasing range, precision, and rates of controllability over the increased number of steps built confidence toward iterative re-tuning of the reward function.

It is critical to understand that a random increase in the weights would increase pny, thus sharply decreasing the reward for each step. Therefore, thorough analysis is required during formulation of the reward function as an ill developed reward function would result in an instability and non-convergence of the MRL algorithm. Keeping same concern in focus, next, the difference of rates with their corresponding desired absolute values was also included in the reward function as elaborated in Eq. (27).

$$\begin{aligned} \begin{aligned} \varvec{pny} =&\varvec{n_1|P|} + \varvec{n_2|Q|} + \varvec{n_3|R|} + \varvec{n_4\varDelta P} + \varvec{n_5\varDelta Q} + \\&\varvec{n_6\varDelta R} + \varvec{n_7\delta P} + \varvec{n_8\delta Q} + \varvec{n_9\delta R} \end{aligned} \end{aligned}$$

(27)

where $\varvec{\delta }$ represents the difference from the desired reference value in the penalty part of the reward function. Interim results on the basis of reward function mentioned as Eq. (27) show improvement in controlling the rates as shown in Figures 17, 18, 19. The reward started to increase with each step of the episode, as shown in Figure 20. Similarly, the range, lateral distance, and altitude showed considerably improved results as depicted in Figure 21 and Figure 22. A gliding range of around $\textit{\textbf{63 kms}}$ was achieved.

The iterative process of formulating an optimal reward function continued clearly focusing on arresting the rates variation. To improve the control of states, additional dynamic weights $n_7$, $n_8$, $n_9$, and $n_10$ were also added to the already finalized structure Eq. (27) for gaining an effective control of the changing rates with each step of the episode. Subsequently, $y\, dis$ parameter was also added in the penalty to restrict platforms lateral movement in the Y-direction. Additionally, the attribute of altitude decrease was also included in the r, i.e., zpos, to contribute positively with every step. The final reward function is shown as a set of Eq. (28).

$$\begin{aligned} \begin{aligned} \varvec{pny} =&\varvec{n_1|P|} + \varvec{n_2|Q|} + \varvec{n_3|R|} + \varvec{n_4\varDelta P} + \varvec{n_5\varDelta Q} \\&+ \varvec{n_6\varDelta R} + \varvec{n_7\delta P} + \varvec{n_8\delta Q} + \varvec{n_9\delta R} + \varvec{n_10ydis} \\ \varvec{r} =&\varvec{10^{-3}} \times \varvec{xpos^2} + (\varvec{36000} - \varvec{zpos}) \\ \varvec{rew} =&\varvec{r} - \varvec{pny}\\ \end{aligned} \end{aligned}$$

(28)

After incorporation of the final reward function as a set of Eq. (28) in the control algorithm, final results corresponding to all states of MRL-based controller, plotted against sequential episodic time steps for the glide vehicle are presented in ensuing paragraphs. The selection of optimal control deflections by the controller during the flight regime amidst changing scenarios can be appreciated from the states’ results and the gliding range achieved.

Variation of rates during the flight of UAV are depicted in Figure 23. Initial negative spike in roll and yaw rates highlights the exploration phase of the agent where it learns to select best control deflections trying to arrest the increasing roll and yaw rates. The graph also validates the vital role and yaw coupling because of the unconventional design of the UAV. After 500 episodic steps, an optimal trade-off among the rates achieves the maximum glide range.

Figures 24, 25, and 26 explain Euler angles variation during the flight. Considerable variation in roll angle (around $\pm 3^{\circ }$) is initially experienced until the time rates settle. Later, it determines to $\pm 0.8^{\circ }$) which indicates loss of negligible energy. The pitch angle variation in an episode is initially large (around 0 to - 4 degs) until the time rates are conserved. Later, it is close to - 2 dogs but shows a slight diverging behavior at the culmination of the episode, which is not desirable but acceptable. The variation of yaw angle in an attack is initially considerable (around +/- 2 degs) until the time optimal rates trade-off is achieved. Later, it’s close to + 1 deg because the UAV is covering eastwards lateral distance. The initial variation (up to 500 episodic steps) in roll and pitch rate can also be connected with the roll angle variation.

Figure 27 shows the glide path of the UAV. Platform achieved an optimal range of more than 120 kms. While maintaining smooth descent, UAV maintains a constant yaw angle of around 1 deg, and the total lateral distance covered in the entire gliding flight is approximately 2.4 kms.

Figure 28 depicts the variation of aerodynamic angles during the flight. Angle of attack launched from initial $2^{\circ }$ is maintained around $2.6^{\circ }$, after controlling the initial fluctuation of the body rates. Side slip angle is adjusted during the flight to achieve maximum range.

Velocity decreases smoothly as a result of drag and slight increase of alpha as shown in Figure 29.

Altitude variation is smooth along the trajectory, and the vehicle descent is controlled optimally to maximize the range as shown in Figure 30.

It is evident from the results that the autonomous MRL controller continuously arrests the rate through the reward function while keeping them within limits in pursuit of optimal performance. The reward function graph gradually grows while increasing reward, thus indicating optimal actions being taken at every step of the episode.

5 Conclusion

In this research, RL-based intelligent nonlinear controller for an experimental glide UAV was proposed utilizing the MRL algorithm. Implemented control algorithm showed promising results in achieving the primary objective of maximizing the range while keeping the platform stable within its design constraints throughout the flight regime. MRL approach gave the optimal range of around $\varvec{120 \ kms}$, while handling the nonlinearity of vehicle (controlling the roll, pitch, and yaw rates in a trade-off) through effective control deflections, which were being monitored by the changing reward function. Devised RL algorithm is proved to be computationally acceptable, wherein the agent was successfully trained for large state and action space.

The performance of the controller was evaluated in a 6-DoF simulation developed with the help of MATLAB and FlightGear software. RL-based controller outperformed the classical controller as being effective in the entire flight regime of the vehicle, thus disregarding the conventional approaches of calculating various equilibrium’s during the trajectory and then trying to keep the vehicle stable within the ambit of these equilibria utilizing linear/nonlinear methods. The investigations made in this research provide a mathematical-based analysis for designing a preliminary guidance and control system for the aerial vehicles using intelligent controls. This research must open avenues for researchers for designing intelligent control systems for aircraft, UAVs, and the autonomous control of missile trajectories for both powered and un-powered configurations.

Data Availability Statement

Data are available from the authors upon reasonable request.

Abbreviations

b ::: Wing span (m)
$\tilde{c}$ ::: Mean aerodynamic chord (m)
CAD ::: Computer-aided design
CFD ::: Computational fluid dynamics
$C_{M_x}$::: Coefficient of rolling moment
$C_{M_y}$ ::: Coefficient of pitching moment
$C_{M_z}$ ::: Coefficient of yawing moment
$C_{F_x}$ ::: Force coefficient in the X-direction
$C_{F_y}$ ::: Force coefficient in the Y-direction
$C_{F_z}$ ::: Force coefficient in the Z-direction
DoF ::: Degree of freedom
DDD ::: Dull dirty and dangerous
g ::: Acceleration due to gravity $(m/sec^2)$
h ::: Altitude (m)
LF ::: Left-side control fin
MRL ::: Model-free reinforcement learning
ML ::: Machine learning
m ::: Mass of the vehicle (kg)
$P_E$ ::: East position vector (km)
$P_N$ ::: North position vector (km)
P ::: Roll rate (deg/sec)
Q ::: Pitch rate (deg/sec)
R ::: Yaw rate (deg/sec)
RL ::: Reinforcement learning
RF ::: Right-side control fin
S ::: Wing area $(m^2)$
UAV ::: Unmanned aerial vehicle
$V_T$ ::: Far stream velocity (m/sec)
n ::: Numerical weights
xpos ::: Current X-position(m)
zpos ::: Current Z-position(m)
r ::: Momentary reward
R ::: Total reward
pny ::: Penalty
$\alpha $ ::: Angle of attack (deg)
$\beta $ ::: Sideslip angle (deg)
$\gamma $ ::: Flight path angle (deg)
$\psi $ ::: Yaw angle (deg)
$\phi $ ::: Roll angle (deg)
$\theta $ ::: Theta angle (deg)
$\delta _L$ ::: LF deflection (deg)
$\delta _R$ ::: RF deflection (deg)
$\rho $ ::: Air density $(kg/m^3)$

References

Yanushevsky, R.: Guidance of Unmanned Aerial Vehicles. CRC Press (2011)
Mir, I.; Eisa, S.; Taha, H.E.; Gul, F.: On the stability of dynamic soaring: Floquet-based investigation. In AIAA SCITECH 2022 Forum, page 0882, (2022)
Mir, I.; Eisa, S.; Maqsood, A.; Gul, F.: Contraction analysis of dynamic soaring. In AIAA SCITECH 2022 Forum, page 0881, (2022)
Mir, I.; Taha, H.; Eisa, S.A.; Maqsood, A.: A controllability perspective of dynamic soaring. Nonlinear Dyn. 94(4), 2347–2362 (2018)
Article MATH Google Scholar
Mir, I.; Maqsood, A.; Eisa, S.A.; Taha, H.; Akhtar, S.: Optimal morphing-augmented dynamic soaring maneuvers for unmanned air vehicle capable of span and sweep morphologies. Aerosp. Sci. Technol. 79, 17–36 (2018)
Article Google Scholar
Mir, I.; Maqsood, A.; Akhtar, S.: Optimization of dynamic soaring maneuvers to enhance endurance of a versatile uav. In IOP Conference Series: Materials Science and Engineering, volume 211, page 012010. IOP Publishing, (2017)
Cai, G.; Dias, J.; Seneviratne, L.: A survey of small-scale unmanned aerial vehicles: Recent advances and future development trends. Unmanned Syst. 2(02), 175–199 (2014)
Article Google Scholar
Mir, I.; Eisa, S.A.; Taha, H.E.; Maqsood, A.; Akhtar, S.; Islam, T.U.: A stability perspective of bio-inspired uavs performing dynamic soaring optimally. Bioinspir, Biomim (2021)
Mir, I.; Akhtar, S.; Eisa, S.A.; Maqsood, A.: Guidance and control of standoff air-to-surface carrier vehicle. Aeronaut. J. 123(1261), 283–309 (2019)
Article Google Scholar
Mir, I.; Maqsood, A.; Taha, H.E.; Eisa, S.A.: Soaring energetics for a nature inspired unmanned aerial vehicle. In AIAA Scitech 2019 Forum, page 1622, (2019)
Elmeseiry, N.; Alshaer, N.; Ismail, T.: A detailed survey and future directions of unmanned aerial vehicles (uavs) with potential applications. Aerospace 8(12), 363 (2021)
Article Google Scholar
Giordan, Daniele; Adams, Marc S.; Aicardi, Irene; Alicandro, Maria; Allasia, Paolo; Baldo, Marco; De Berardinis, Pierluigi; Dominici, Donatella; Godone, Danilo; Hobbs, Peter; et al.: The use of unmanned aerial vehicles (uavs) for engineering geology applications. Bulletin of Engineering Geology and the Environment 79(7), 3437–3481 (2020)
Article Google Scholar
Winkler, Stephanie; Zeadally, Sherali; Evans, Katrine: Privacy and civilian drone use: The need for further regulation. IEEE Security & Privacy 16(5), 72–80 (2018)
Article Google Scholar
Nurbani, Erlies Septiana: Environmental protection in international humanitarian law. Unram Law Review, 2(1), (2018)
Giordan, Daniele; Hayakawa, Yuichi; Nex, Francesco; Remondino, Fabio; Tarolli, Paolo: The use of remotely piloted aircraft systems (rpass) for natural hazards monitoring and management. Natural hazards and earth system sciences 18(4), 1079–1096 (2018)
Article Google Scholar
Nikolakopoulos, Konstantinos G.; Soura, Konstantina; Koukouvelas, Ioannis K.; Argyropoulos, Nikolaos G.: Uav vs classical aerial photogrammetry for archaeological studies. Journal of Archaeological Science: Reports 14, 758–773 (2017)
Google Scholar
Abualigah, Laith; Diabat, Ali; Sumari, Putra; Gandomi, Amir H.: Applications, deployments, and integration of internet of drones (iod): a review. IEEE Sensors Journal, (2021)
Mir, Imran; Eisa, Sameh A.; Maqsood, Adnan: Review of dynamic soaring: technical aspects, nonlinear modeling perspectives and future directions. Nonlinear Dynamics 94(4), 3117–3144 (2018)
Article Google Scholar
Mir, Imran; Maqsood, Adnan; Akhtar, Suhail: Biologically inspired dynamic soaring maneuvers for an unmanned air vehicle capable of sweep morphing. International Journal of Aeronautical and Space Sciences 19(4), 1006–1016 (2018)
Article Google Scholar
Mir, Imran; Maqsood, Adnan; Akhtar, Suhail: Dynamic modeling & stability analysis of a generic uav in glide phase. In MATEC Web of Conferences, volume 114, page 01007. EDP Sciences, (2017)
Mir, Imran; Eisa, Sameh A.; Taha, Haithem; Maqsood, Adnan; Akhtar, Suhail; Islam, Tauqeer Ul: A stability perspective of bioinspired unmanned aerial vehicles performing optimal dynamic soaring. Bioinspiration & Biomimetics 16(6), 066010 (2021)
Article Google Scholar
Gul, Faiza; Alhady, Syed Sahal Nazli.; Rahiman, Wan: A review of controller approach for autonomous guided vehicle system. Indonesian Journal of Electrical Engineering and Computer Science 20(1), 552–562 (2020)
Article Google Scholar
Gul, Faiza; Rahiman, Wan: An integrated approach for path planning for mobile robot using bi-rrt. In IOP Conference Series: Materials Science and Engineering, volume 697, page 012022. IOP Publishing, (2019)
Gul, F.; Rahiman, W.; Alhady, S.S.; Nazli: A comprehensive study for robot navigation techniques. Cogent Eng. 6(1), 1632046 (2019)
Agushaka, Jeffrey O.; Ezugwu, Absalom E.; Abualigah, Laith: Dwarf mongoose optimization algorithm. Computer Methods in Applied Mechanics and Engineering 391, 114570 (2022)
Article MathSciNet MATH Google Scholar
Abualigah, Laith; Yousri, Dalia; Elaziz, Mohamed Abd; Ewees, Ahmed A.; Al-Qaness, Mohammed AA.; Gandomi, Amir H.: Aquila optimizer: a novel meta-heuristic optimization algorithm. Computers & Industrial Engineering 157, 107250 (2021)
Article Google Scholar
Abualigah, Laith; Elaziz, Mohamed Abd; Sumari, Putra; Geem, Zong Woo; Gandomi, Amir H.: Reptile search algorithm (rsa): A nature-inspired meta-heuristic optimizer. Expert Systems with Applications 191, 116158 (2022)
Article Google Scholar
Abualigah, Laith; Diabat, Ali; Mirjalili, Seyedali; Elaziz, Mohamed Abd; Gandomi, Amir H.: The arithmetic optimization algorithm. Computer methods in applied mechanics and engineering 376, 113609 (2021)
Article MathSciNet MATH Google Scholar
Oyelade, Olaide N.; Ezugwu, Absalom E.; Mohamed, Tehnan IA.; Abualigah, Laith: Ebola optimization search algorithm: A new nature-inspired metaheuristic algorithm with application in medical image classification problem. IEEE Access, (2022)
Thorndike, EL: Animal intelligence, darien, ct, (1911)
Sutton, Richard S; Barto, Andrew G: Planning and learning. In Reinforcement Learning: An Introduction., ser. Adaptive Computation and Machine Learning, pages 227–254. A Bradford Book, (1998)
Verma, Sagar: A survey on machine learning applied to dynamic physical systems. arXiv preprintarXiv:2009.09719, (2020).
Dalal, Gal; Dvijotham, Krishnamurthy; Vecerik, Matej; Hester, Todd; Paduraru, Cosmin; Tassa, Yuval: Safe exploration in continuous action spaces. arXiv preprintarXiv:1801.08757, (2018)
Garcıa, Javier; Fernández, Fernando: A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16(1), 1437–1480 (2015)
MathSciNet MATH Google Scholar
Matthew Kretchmar, R.; Young, Peter M.; Anderson, Charles W.; Hittle, Douglas C.; Anderson, Michael L.; Delnero, Christopher C.: Robust reinforcement learning control with static and dynamic stability. International Journal of Robust and Nonlinear Control: IFAC-Affiliated Journal 11(15), 1469–1500 (2001)
Article MathSciNet MATH Google Scholar
Mannucci, Tommaso; van Kampen, Erik-Jan.; de Visser, Cornelis; Chu, Qiping: Safe exploration algorithms for reinforcement learning controllers. IEEE transactions on neural networks and learning systems 29(4), 1069–1081 (2017)
Article Google Scholar
Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A.; Veness, Joel; Bellemare, Marc G.; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K.; Ostrovski, Georg; et al.: Human-level control through deep reinforcement learning. nature 518(7540), 529–533 (2015)
Article Google Scholar
Koch, Wil; Mancuso, Renato; West, Richard; Bestavros, Azer: Reinforcement learning for uav attitude control. ACM Transactions on Cyber-Physical Systems 3, 04 (2018)
Nurten, EMER; Özbek, Necdet Sinan: Control of attitude dynamics of an unmanned aerial vehicle with reinforcement learning algorithms. Avrupa Bilim ve Teknoloji Dergisi, (29):351–357.
Pi, Chen-Huan.; Ye, Wei-Yuan.; Cheng, Stone: Robust quadrotor control through reinforcement learning with disturbance compensation. Applied Sciences 11(7), 3257 (2021)
Article Google Scholar
Xiang, Shuiying; Ren, Zhenxing; Zhang, Yahui; Song, Ziwei; Guo, Xingxing; Han, Genquan; Hao, Yue: Training a multi-layer photonic spiking neural network with modified supervised learning algorithm based on photonic stdp. IEEE Journal of Selected Topics in Quantum Electronics 27(2), 1–9 (2020)
Article Google Scholar
Zhang, Baochang; Mao, Zhili; Liu, Wanquan; Liu, Jianzhuang: Geometric reinforcement learning for path planning of uavs. Journal of Intelligent & Robotic Systems 77(2), 391–409 (2015)
Article Google Scholar
Jingzhi, Hu.; Zhang, Hongliang; Di, Boya; Li, Lianlin; Bian, Kaigui; Song, Lingyang; Li, Yonghui; Han, Zhu; Vincent Poor, H.: Reconfigurable intelligent surface based rf sensing: Design, optimization, and implementation. IEEE Journal on Selected Areas in Communications 38(11), 2700–2716 (2020)
Article Google Scholar
Poksawat, Pakorn; Wang, Liuping; Mohamed, Abdulghani: Gain scheduled attitude control of fixed-wing uav with automatic controller tuning. IEEE Transactions on Control Systems Technology 26(4), 1192–1203 (2017)
Article Google Scholar
Rinaldi, F.; Chiesa, S.; Quagliotti, Fulvia: Linear quadratic control for quadrotors uavs dynamics and formation flight. Journal of Intelligent & Robotic Systems 70(1–4), 203–220 (2013)
Article Google Scholar
Araar, Oualid; Aouf, Nabil: Full linear control of a quadrotor uav, lq vs hinf. In 2014 UKACC International Conference on Control (CONTROL), pages 133–138. IEEE, (2014)
Brière, Dominique; Traverse, Pascal: Airbus a320/a330/a340 electrical flight controls-a family of fault-tolerant systems. In FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing, pages 616–623. IEEE, (1993)
Doyle, John; Lenz, Kathryn; Packard, Andy: Design examples using $\mu $-synthesis: Space shuttle lateral axis fcs during reentry. In Modelling, Robustness and Sensitivity Reduction in Control Systems, pages 127–154. Springer, (1987)
Kulcsar, Balazs: Lqg/ltr controller design for an aircraft model. Periodica Polytechnica Transportation Engineering 28(1–2), 131–142 (2000)
Google Scholar
Escareno, Juan; Salazar-Cruz, S; Lozano, R.: Embedded control of a four-rotor uav. In 2006 American Control Conference, pages 6–pp. IEEE, (2006)
Derafa, L.; Ouldali, A.; Madani, T.; Benallegue, A.: Non-linear control algorithm for the four rotors uav attitude tracking problem. The Aeronautical Journal 115(1165), 175–185 (2011)
Article Google Scholar
Adams, Richard J.; Banda, Siva S.: Robust flight control design using dynamic inversion and structured singular value synthesis. IEEE Transactions on control systems technology 1(2), 80–92 (1993)
Article Google Scholar
Zhou, Y.: Online reinforcement learning control for aerospace systems. (2018).
Kaelbling, Leslie Pack; Littman, Michael L.; Moore, Andrew W.: Reinforcement learning: A survey. Journal of artificial intelligence research 4, 237–285 (1996)
Article Google Scholar
Zhou, Conghao; He, Hongli; Yang, Peng; Lyu, Feng; Wu, Wen; Cheng, Nan; Shen, Xuemin: Deep rl-based trajectory planning for aoi minimization in uav-assisted iot. In 2019 11th International Conference on Wireless Communications and Signal Processing (WCSP), pages 1–6. IEEE, (2019)
Bansal, Trapit; Pachocki, Jakub; Sidor, Szymon; Sutskever, Ilya; Mordatch, Igor: Emergent complexity via multi-agent competition. arXiv preprintarXiv:1710.03748, (2017)
Kim, Donghae; Gyeongtaek, Oh.; Seo, Yongjun; Kim, Youdan: Reinforcement learning-based optimal flat spin recovery for unmanned aerial vehicle. Journal of Guidance, Control, and Dynamics 40(4), 1076–1084 (2017)
Article Google Scholar
Dutoi, Brian; Richards, Nathan; Gandhi, Neha; Ward, David; Leonard, John: Hybrid robust control and reinforcement learning for optimal upset recovery. In AIAA Guidance, Navigation and Control Conference and Exhibit, page 6502, (2008)
Wickenheiser, Adam M.; Garcia, Ephrahim: Optimization of perching maneuvers through vehicle morphing. Journal of Guidance Control and Dynamics 31(4), 815–823 (2008)
Article Google Scholar
Novati, Guido; Mahadevan, Lakshminarayanan; Koumoutsakos, Petros: Deep-reinforcement-learning for gliding and perching bodies. arXiv preprintarXiv:1807.03671, (2018)
Kroezen, Dave: Online reinforcement learning for flight control: An adaptive critic design without prior model knowledge. (2019)
Haarnoja, T.; Zhou, A.; Ha, S.; Tan, J.; Tucker, G.; Levine, S.; Dec, LG.: Learning to walk via deep reinforcement learning. arxiv 2019. arXiv preprintarXiv:1812.11103.
Silver, David; Huang, Aja; Maddison, Chris J.; Guez, Arthur; Sifre, Laurent; Van Den Driessche, George; Schrittwieser, Julian; Antonoglou, Ioannis; Panneershelvam, Veda; Lanctot, Marc; et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016)
Xenou, Konstantia; Chalkiadakis, Georgios; Afantenos, Stergos: Deep reinforcement learning in strategic board game environments. In European Conference on Multi-Agent Systems, pages 233–248. Springer, (2018)
Kimathi, Stephen: Application of reinforcement learning in heading control of a fixed wing uav using x-plane platform. (2017)
Pham, Huy X.; La, Hung M.; Feil-Seifer, David; Nguyen, Luan V: Autonomous uav navigation using reinforcement learning. arXiv preprintarXiv:1801.05086, (2018)
Rodriguez-Ramos, Alejandro; Sampedro, Carlos; Bavle, Hriday; De La Puente, Paloma; Pascual, Campoy: A deep reinforcement learning strategy for uav autonomous landing on a moving platform. Journal of Intelligent & Robotic Systems 93(1–2), 351–366 (2019)
Article Google Scholar
Petterson, Kristian: Cfd analysis of the low-speed aerodynamic characteristics of a ucav. AIAA Paper 1259, 2006 (2006)
Google Scholar
Finck, R.D.: Air Force Flight Dynamics Laboratory (US), and DE Hoak. USAF stability and control DATCOM, Engineering Documents (1978)
Roskam, J.: Airplane design 8vol. (1985)
Buning, P.G.; Gomez, R.J.; Scallion, W.I.: Cfd approaches for simulation of wing-body stage separation. AIAA Paper 4838, 2004 (2004)
Google Scholar
Hafner, R.; Riedmiller, M.: Reinforcement learning in feedback control. Mach. Learn. 84(1–2), 137–169 (2011)
Article MathSciNet Google Scholar
Laroche, R.; Feraud, R.: Reinforcement learning algorithm selection. arXiv preprintarXiv:1701.08810, (2017)
Kingma, D.P.; Adam, J.B.: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, (2014)
Bellman, R.: Dynamic programming. Science 153(3731), 34–37 (1966)
Article MATH Google Scholar
Bellman, R.E.; Dreyfus, S.E.: Applied Dynamic Programming. Princeton university press (2015)
Liu, D.; Wei, Q.; Wang, D.; Yang, X.; Li, H.: Adaptive Dynamic Programming with Applications in Optimal Control. Springer (2017)
Luo, B.; Liu, D.; Huai-Ning, W.; Wang, D.; Lewis, F.L.: Policy gradient adaptive dynamic programming for data-based optimal control. IEEE Trans. Cybern. 47(10), 3341–3354 (2016)
Article Google Scholar
Bouman, P.; Agatz, N.; Schmidt, M.: Dynamic programming approaches for the traveling salesman problem with drone. Networks 72(4), 528–542 (2018)
Article MathSciNet Google Scholar
Silver, D.; Lever, G.: Nicolas, H.; Daan, W., Martin, R.: Deterministic policy gradient algorithms, Thomas Degris (2014)
Matignon, L.; Laurent, G.J; Le Fort-Piat, N.: Reward function and initial values: better choices for accelerated goal-directed reinforcement learning. In International Conference on Artificial Neural Networks, pages 840–849. Springer, (2006)
Gleave, A.; Dennis, M.; Legg, S.; Russell, S.; Leike, J.: Quantifying differences in reward functions. arXiv preprint arXiv:2006.13900, (2020)

Download references

Author information

Authors and Affiliations

Institute of Avionics and Aeronautics, Air University, Islamabad, Pakistan
Adnan Fayyaz Ud Din
Department of Avionics Engineering, Air University, Aerospace and Aviation, Campus Kamra, Islamabad, Pakistan
Imran Mir & Faiza Gul
School of Engineering and Technology, Department of Information Technology, ALDAR University College, Garhoud, Dubai, UAE
Mohammad Rustom Al Nasar
Faculty of Computer Sciences and Informatics, Amman Arab University, Amman, 11953, Jordan
Laith Abualigah
School of Computer Sciences, Universiti Sains Malaysia, Pulau Pinang, 11800, Malaysia
Laith Abualigah

Authors

Adnan Fayyaz Ud Din
View author publications
You can also search for this author in PubMed Google Scholar
Imran Mir
View author publications
You can also search for this author in PubMed Google Scholar
Faiza Gul
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Rustom Al Nasar
View author publications
You can also search for this author in PubMed Google Scholar
Laith Abualigah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Imran Mir.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Din, A.F.U., Mir, I., Gul, F. et al. Reinforced Learning-Based Robust Control Design for Unmanned Aerial Vehicle. Arab J Sci Eng 48, 1221–1236 (2023). https://doi.org/10.1007/s13369-022-06746-0

Download citation

Received: 26 October 2021
Accepted: 20 February 2022
Published: 24 March 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s13369-022-06746-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Reinforced Learning-Based Robust Control Design for Unmanned Aerial Vehicle

Abstract

Similar content being viewed by others

Implementation of partially tuned PD controllers of a multirotor UAV using deep deterministic policy gradient

Modified model free dynamic programming :an augmented approach for unmanned aerial vehicle

Design of Reinforcement Learning Environment for Multiple UAV Area Coverage Search

Explore related subjects

1 Introduction

1.1 Relevant Studies

1.2 Research Contributions

2 Problem Setup

2.1 UAV Geometric and Mass parameters

2.2 UAV Mathematical Modeling

2.3 Aerodynamic Evaluation

3 MRL Framework

3.1 Introduction

3.2 RL Algorithm Selection Challenge

3.3 Model-Free Reinforcement Learning (MRL) and RL Dynamic Programming (DP) Architecture

3.3.1 RL Dynamic Programming (DP)

3.3.2 MRL Framework

3.3.3 MRL Controller Development Architecture

4 Results and Discussion

4.1 MRL Controller Results

5 Conclusion

Data Availability Statement

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Informed Consent

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation