Keywords

1 Introduction

The area of autonomous robotics is rapidly growing due to the range of tasks that robots can perform [7]. Examining a number of missions done by such robots, it is discovered that there is an increase in risk. In many of these instances, the integrity of the human workers sent at the point of dispatch is at danger. As the operator is removed from the hazardous shipping point, robots arrive to handle the situation. Because of their usefulness, robots have been mass-produced. In addition to ground robots, there are also flying robots which keep the scientific community busy. Researchers are concerned about the autonomy of such systems [6]. This specific difficulty is intriguing, as groups of unmanned aerial vehicles (UAVs) now take off frequently. Coordinating various systems with a shared goal might provide a solution in a variety of circumstances, such as forest patrols, identification of suspicious behaviour in stalking instances, etc.

When studying designs of autonomous computer systems for unmanned aerial vehicles, the subject of accessible information arises, i.e., what information is available that may be utilised by the system to solve the autonomy problem? The standard sensors of an unmanned aerial vehicle are: i) Camera(s) ii) Radar(s) iii) a Time-of-flight sensor and iv) a LiDar Sensor [8].

Other than the first sensor, the others contribute to obstacle detection in an unknown areas. The camera of the UAV gathers all required information. Using image processing techniques mixed with new approaches of artificial intelligence, there are presently systems that can do remarkably well in such tasks given the right circumstances [2]. Typically, the sole stipulation is that there must be sufficient light for the camera to capture the required information in fine detail. Smaller resolution might entail higher uncertainty about the solution the system software will compute at its output, and hence there might be an increase in risk.

2 Preliminaries

A challenge in reinforcement learning (RL) may be characterised as a Markov process decision consisting of four sets [1] \(\langle S, A, R, T\rangle \). S is a limited collection of training circumstances experienced by the agent. The information provided by the environment at the input of each training cycle for the agent configures the current state of the agent. The A is a limited collection of responses to a circumstance from which an agent may pick. R(sa) computes the reward offered to a state s after an action a. \(\textrm{H} T(s, a, \hat{s}) \rightarrow [0,1]\) is a potential transfer function. It specifies the likelihood of transitioning from state s to state \(\hat{s}\) after action a. The functions R(sa) and \(T(s, a, \hat{s})\) solely rely on the current values of s, a, and \(\hat{s}\). The probability density of a Markov decision process is defined as follows:

$$\begin{aligned} {\text {Pr}}\left\{ s_{t+1}=\hat{s}, r_{t+1}=r \mid s_{t}, a_{t}, r_{t}, s_{t-1}, a_{t-1}, \ldots , r_{1}, s_{0}, a_{0}\right\} \end{aligned}$$
(1)

Equation (1) is a specific example of the more general situation in which the present state also relies on earlier states and actions (2).

$$\begin{aligned} {\text {Pr}}\left\{ s_{t+1}=\hat{s}, r_{t+1}=r \mid s_{t}, a_{t}\right\} \end{aligned}$$
(2)

When Eq. (1) is equivalent to Eq. (2) for all \(\hat{s}, r\) and the sequence of form \(s_{t}, a_{t}, r_{t}, s_{1}, a_{1}, r_{1}, s_{0}, a_{0}, r_{0}\) then the case holds the Markov property.

When investigating RL algorithms, the value of the Markovian property is substantial. However, in the majority of cases, agents do not have access to all of the data hence, there is concealed state data, preserving the Markov condition. Such data are described as Markov decision processes that are partly observable.

2.1 Value Function

Each training cycle is comprised of the agent observing a state and executing one action. The mapping between state and action is known as policy \(\pi \). Specifically, the chance of selecting an action a in a given state s under policy is specified by the probability function \(V^{\pi }\) as \(\pi (s, a) \rightarrow [0,1]\) as defined by Eq. (3).

$$\begin{aligned} V^{\pi }(s)=E_{\pi }\left\{ R_{t} \mid s_{t}=s\right\} =E_{\pi }\left\{ \sum _{k=0}^{\infty } \gamma ^{k} r_{t+k+1} \mid s_{t}=s\right\} \end{aligned}$$
(3)

In Eq. (3), \(E_{\pi }\) is the expected value received by agent if follow policy \(\pi \) for a state \(s_{t}\). Equation (3) can be rewritten, as shown in Eq. (4), as the sum of the rewards presented in the training cycles performed by the agent.

$$\begin{aligned} V_{\pi }\left( s_{t}\right) \equiv r_{t}+\gamma r_{t+1}+\gamma ^{2} r_{t+2}+\cdots \Longrightarrow V_{\pi }\left( s_{t}\right) \equiv \sum _{k=0}^{\infty } \gamma ^{k} r_{t+i} \end{aligned}$$
(4)

In Eq. (4), the factor \(\gamma \in [0,1]\) is a degradation value of this sum so that the most recent rewards gain more weight. An equally important concept is the function \(Q^{\pi }\), also known as the energy function and payoff for policy \(\pi \). \(Q^{\pi }\) is the expected payoff having chosen an action a in a state s and following policy \(\pi \). The function \(Q^{\pi }(s, a)\) is given by Eq. (5).

$$\begin{aligned} Q^{\pi }(s, a)=E_{\pi }\left\{ R_{t} \mid s_{t}=s, a_{t}=a\right\} =E_{\pi }\left\{ \sum _{k=0}^{\infty } \gamma ^{k} r_{t+k+1} \mid s_{t}=s, a_{t}=a\right\} \end{aligned}$$
(5)

The goal when training a RL agent is to approximate the policy that maximizes the sum of rewards over time. The optimal policy is denoted by \(\pi ^{*}\) and the value function which returns the maximum accumulated reward to the agent in a state s following the optimal policy \(\pi ^{*}\) is denoted by \(V^{*}(s)\) and is called the reward function. \(Q^{*}(s, a)\) is the optimal function of reward, and means the expected reward of choosing an action a in a state s following the optimal policy.

2.2 Multilayer Perceptron Neural Networks

Perceptron multilayer neural networks are composed of many layers of perceptron-type neurons. Perceptron neurons receive input data through the forward propagation technique and are trained using the reverse propagation algorithm. Input vector is received by the first layer of Perceptron neurons. Subsequent layers change the input vector using two parameters: the numerical weight of synapses between neurons at the various levels and the non-linear activation function that follows the output of each neuron.

A multi-layer Perceptron neural network employs a collection of numerical samples created by certain controlling features for each class in the set of classes throughout its training. The format of the produced sample set is X matrix, as shown in Eq. (6).

$$\begin{aligned} X=\left[ \begin{array}{cccc} \mid &{} \mid &{} \cdots &{} \mid \\ x^{(1)} &{} x^{(2)} &{} \cdots &{} x^{(m)} \\ \mid &{} \mid &{} \cdots &{} \mid \end{array}\right] \end{aligned}$$
(6)

In Eq. (6), each vector \(x^{(i)}\) has dimension n, where n is the number of the attributes used to code the classes. Therefore, it holds that \(X \in \mathbb {R}^{n_{x}, m}\), where m is the number of training samples. Same as the registry X, there is also the Y register in which the desired output of the after model is stored from the transformation of the input data, which is shown in Eq. (7). Each element \(y^{(i)}\) of register Y has a number representing the class it belongs to the sample i. Therefore, it holds that \(Y \in \mathbb {R}^{i, m}\).

$$\begin{aligned} Y=\left[ \begin{array}{llll} y^{(1)}&y^{(2)}&\ldots&y^{(m)} \end{array}\right] \end{aligned}$$
(7)

From the forward propagation of the X matrix in the model, intermediates arise transformations of the matrix X. The matrix A in each hidden level j of the neural network is defined as in Eq. (8).

$$\begin{aligned} A^{[j]}=\left[ \begin{array}{cccc} \mid &{} \mid &{} \ldots &{} \mid \\ a^{(1)[j]} &{} a^{(2)[j]} &{} \ldots &{} x^{(m)[j]} \\ \mid &{} \mid &{} \cdots &{} \mid \end{array}\right] \end{aligned}$$
(8)

The synapse weights of a multi-layer Perceptron neural network for one layer j are described in Eq. (9).

$$\begin{aligned} W^{[j]}=\left[ \begin{array}{cccc} w_{1,1}^{[j]} &{} w_{1,2}^{[j]} &{} \cdots &{} w_{1, m}^{[j]} \\ w_{2,1}^{[j]} &{} w_{2,2}^{[j]} &{} \ldots &{} w_{2, m}^{[j]} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ w_{n, 1}^{[j]} &{} w_{n, 2}^{[j]} &{} \cdots &{} w_{n, m}^{[j]} \end{array}\right] \end{aligned}$$
(9)

In Eq. (9), the subscript n refers to the number of neurons in layer j-1, while the index m refers to the number of neurons in level j. At each level j, there is also a special neuron called a threshold. Threshold has no synapses with the previous level. A constant is stored in the threshold which is propagated only at the respective level. Considering \(b_{j}\) the threshold at a level j, then Eq. (10) describes the forward propagation result algorithm from a level j-1 to a level j.

$$\begin{aligned} Z^{[j]}=\left( W^{[j]}\right) ^{T} \times A^{[j-1]} \end{aligned}$$
(10)

Threshold \(b^{[j]}\) is added to each element of register \(Z^{[j]}\). Then, it takes effect in the register a non-linear function called the activation function. With the activation function, the model can categorize the samples of the set education. In case there is no trigger function, then the model performs a simple linear transformation of the input data and therefore will not learn complex patterns. If \(g(\cdot )\) is the activation function, then at the end of a cycle forward propagation of the neural network, the result shown in Eq. (11).

$$\begin{aligned} A^{[j]}=g\left( Z^{[j]}+b^{[j]}\right) \end{aligned}$$
(11)

3 Methodology

3.1 Deep Q Networks

Deep Q-nets were developed as a result of the necessity for an algorithm capable of tackling a broad variety of issues. It integrates reinforcement learning concepts with artificial neural networks. The development of complex topologies for artificial neural networks, which use additional layers of neurons for more effective generalisation and pattern recognition, made it feasible for machines to learn categories from raw data. Deep Q networks primarily take tensors as input; hence, deep convolutional networks are used for the identification of needed standards. At the model’s input, the tensor shells are pictures of a single channel.

Deep Q networks solve challenges involving agent-environment interaction. In particular, it posits that the agent may observe a circumstance, choose an action, and get a reward. The agent’s objective is to choose the sequence of behaviours that accumulates the greatest potential amount of rewards over time. Specifically, a neural network with deep convolution is employed to approximate the ideal energy-reward function shown in Eq. (12).

$$\begin{aligned} Q^{*}(s, a)=\max _{\pi } \mathbb {E}\left[ r_{t}+\gamma r_{t+1}+\gamma ^{2} r_{t+2}+\ldots \mid s_{t}=s, a_{t}=a, \pi \right] \end{aligned}$$
(12)

Equation (12) expresses the maximum sum of rewards \(r_{t}\) reduced by a factor \(\gamma \) in each training cycle t. This sum can be achieved following the behavior defined by the policy function \(\pi =P(a \mid s)\) in a state s performing action a.

Neural networks like other non-linear methods until the discovery of deep Q networks were considered unsuitable for approximating the function energy-reward (also known as Q function). The difficulties are enough, such as for example small changes in the Q function that can produce large ones changes in agent policy. The concept of deep Q networks uses experience repetition mechanism that solves the problem of correlating similar data in a time sequence by reducing the rate of change of the data distribution. The second element of deep Q networks is the iterative update of the Q function in order to more accurately approximate the optimal energy-action function \(Q^{*}\).

A tuple is stored to implement the iterative mechanism \(e_{t}=\left( s_{t}, a_{t}, r_{t}, s_{t+1}\right) \) at each training cycle t forming a data set \(D_{t}=e_{1}, \ldots , e_{t}\). During learning, refreshes are applied following the method of learning Q on samples (chunks) of experiences \((s, a, r, \hat{s}) \sim U(D)\), which are chosen in a random fashion following a uniform distribution from the set \(D_{t}\). For implementation of the iterative renewal mechanism introduces a new neural network itself architecture with the model used to train the agent.

The second model \(\hat{Q}\) aims to estimate the optimal energy. Every t training cycles, \(\hat{Q}\) is synchronized with model Q. This change stabilizes \(\hat{Q}\) algorithm as it deals with small but often unwanted changes in agent policy that would occur at the end of each training cycle. Thus, based on [5], the refresh learning following the learning method Q in an iteration i is configured from the cost function described in Eq. (13).

$$\begin{aligned} L_{i}\left( \theta _{i}\right) =\mathbb {E}_{(s, a, r, \hat{s}) \sim U(D)}\left[ \left( r+\gamma \max _{\hat{a}} Q\left( \hat{s}, \hat{a} ; \theta _{i}^{-}\right) -Q\left( s, a ; \theta _{i}\right) \right) ^{2}\right] \end{aligned}$$
(13)

In Eq. (13), the \(\theta _{i}\) are the parameters of the neural network Q during iteration i and the \(\theta _{i}^{-}\) are the parameters of the neural network \(\hat{Q}\) used for the estimation of the best possible energy during repetition i. Traffic information of the agent in the environment is formed by combining 4 consecutive single frames channel. To avoid overtraining, except for the conversion done in each frame from color to grayscale, the frame is also cropped from 210 \(\times \) 160 to 84 \(\times \) 84. Thus, variable s, which is the convolutional neural network input is a 4 \(\times \) 84 \(\times \) 84 tensor size.

3.2 Double Deep Q Networks

One of the new concepts that appeared in deep Q networks was the usage second neural network to estimate the optimal policy, in parallel with one neural network operating during agent training. In this way, the agent is not affected by small changes in policy and the training process converges as only after a number t training cycles does synchronization occur of the Q model with the \(\hat{Q}\) model. However, this can also lead to overestimation of the objective as the same maximization operator is used, as shown in Eq. (13).

With the algorithm of double deep Q networks a separate operator is introduced maximization for the \(\hat{Q}\) model. Therefore, there is a separate selection operator energy and a separate energy evaluation operator. Also, the policy that is modeled by the \(\epsilon \)-greedy method is evaluated by the Q model, but the reward estimated by the model \(\hat{Q}\). Based on [9], the policy update for the double deep algorithm Q of networks can be described as in Eq. (14).

$$\begin{aligned} Y_{t}^{\text{ Double } \text{ DQN } } \equiv R_{t+1}+\gamma Q\left( S_{t+1}, \underset{a}{{\text {argmax}}} Q\left( S_{t+1}, a ; \theta _{t}\right) , \theta ^{-}\right) \end{aligned}$$
(14)

3.3 Learning with Multiple Training Cycles

The method adopted for solving problems with RL methods uses the model of Markovian decision processes. According In this decision model, an agent interacts in an environment for a series training cycles t. In each training cycle, the agent receives information for the environment state \(S_{t} \in \mathcal {S}\), where S is the set of all possible situations. The agent uses this information to choose an action \(A_{t}\) from the set of all possible actions \(\mathcal {A}\). Based on the agent’s behavior in state \(S_{t} \in \mathcal {S}\), a payoff \(R_{t+1} \in \mathbb {R}\) is calculated and the agent moves to next state \(S_{t+1} \in \mathcal {S}\) with a state transfer probability \(p(\hat{s} \mid s, a)=\) \(\left. \hat{s} \mid S_{t}=s, A_{t}=a\right) \), for \(a \in \mathcal {A}\) and \(S, \hat{s} \in \mathcal {S}\). The behavior of the agent is determined by policy \(\pi (a \mid s)\) follows, which is a probability distribution over the set \(S \times \mathcal {A}\).

During the training of the agent, the optimal policy \(\pi ^{*}\) is formulated that maximizes the estimated reduced total reward, as described in Eq. (15).

$$\begin{aligned} G_{t}=R_{t+1}+\gamma R_{t+2}+\gamma ^{2} R_{t+3}+\cdots =\sum _{k=0}^{T-t-1} \gamma ^{k} R_{t+1+k} \end{aligned}$$
(15)

Algorithms following the time difference method aim to maximizing the amount of \(G_{t}\). The state-value function describes the estimated performance when the agent is in a state s and follows policy \(\pi \), as shown in Eq. (16).

$$\begin{aligned} u_{\pi }=\mathbb {E}\left[ G_{t} \mid S_{t}=s\right] \end{aligned}$$
(16)

At the center of the agent’s training is also the energy function-value, which is the estimated payoff when the agent chooses an action a over a state s following policy \(\pi \), as shown in Eq. (17).

$$\begin{aligned} q_{\pi }=\mathbb {E}_{\pi }\left[ G_{t} \mid S_{t}=s, A_{t}=a\right] \end{aligned}$$
(17)

Equation (17) can be calculated iteratively by observing new rewards based on previous estimates of \(q_{\pi }\) and using the renewal rule that shown in Eq. (18).

$$\begin{aligned} Q\left( S_{t}, A_{t}\right) \leftarrow Q\left( S_{t}, A_{t}\right) +\alpha \left[ R_{t+1}+\gamma Q\left( S_{t+1}, A_{t+1}\right) -Q\left( S_{t}, A_{t}\right) \right] \end{aligned}$$
(18)

In Eq. (18), the constant \(\alpha \in (0,1]\) is the cycle length parameter education. The time difference method can be extended to more cycles education. By carefully choosing a parameter \(n>1\) improved results can be obtained results when training an agent. The result is Eq. (19), which shows the update rule used for learning with multiple cycles of training [3].

$$\begin{aligned} Q_{t+n}\left( S_{t}, A_{t}\right) \leftarrow Q_{t+n-1}\left( S_{t}, A_{t}\right) +\alpha \rho _{t+1}^{t+n}\left[ G_{t: t+n}-Q_{t+n-1}\left( S_{t}, A_{t}\right) \right] \end{aligned}$$
(19)

In Eq. (19), the estimated return \(G_{t: t+n}\) for an agent using learning n training cycles is given by Eq. (20).

$$\begin{aligned} G_{t: t+n}=\sum _{k=0}^{n-1} \gamma ^{k} R_{t+k+1}+\gamma ^{n} Q_{t+n-1}\left( S_{t+n}, A_{t+n}\right) \end{aligned}$$
(20)

In Eq. (19) and Eq. (20), the quantity \(Q_{t+n-1}\) is the estimate of the function \(q_{\pi }\) at time \(t+n-1\) and the subscript \(t: t+n\) denotes the renewal duration. Also, in Eq. (19), the term \(\rho _{t+1}^{t+n}\) sets the sampling mode so that they are selected proportionally the most important samples and is described by Eq. (21).

$$\begin{aligned} \rho _{t}^{t+n} \prod _{k=t}^{\tau } \frac{\pi \left( A_{k} \mid S_{k}\right) }{\mu \left( A_{k} \mid S_{k}\right) } \end{aligned}$$
(21)

In Eq. (21), the variable \(\tau =\min (t+n-1, T-1)\) is the training cycle until the end of the refresh step.

3.4 Rainbow Agent

A Rainbow agent is a combination of all the previous methods (Sects. 3, 3.2, and 3.3). First, the learning duration is increased, as described in Sect. 3.3 using n training cycles for learning. The multiple cycle learning method training is also combined with the dual deep Q network method using the action obtained by following the \(\epsilon \)-greedy method in the state \(S_{t+n}\). Finally, the architecture of the model is conflicting, and on the pathways used to assess benefit and value noise is introduced from a NoisyNet model. Rainbow agent hyperparameters are given in Table 1, as they were optimized. The Rainbow agent [4] successfully combines all previous enhancement learning techniques, as shown in Fig. 1.

Table 1. Hyperparameter table of a Rainbow agent.
Fig. 1.
figure 1

Comparison between the rainbow agent and reinforcement learning methods.

3.5 Problem Formulation

On the complex problem of autonomy in navigation for unmanned aerial vehicles a Rainbow agent was created [4]. Two different ones were created architectures at the Q model level. In the first case, the kernel that uses the agent to extract the necessary patterns from the environment is a multilayer Perceptron neural network, while the second kernel was hybrid. As a core characterizes the model that processes the input data before passing through the advantage and value paths of the overall Q model. The tasks of the agent are as follows:

  1. 1.

    The agent is asked to move from a reference point \(\left\langle \begin{array}{lll}x_{0}&y_{0}&\left. z_{0}\right\rangle \end{array}\right. \) to a target point \(\left\langle \begin{array}{lll}x_{1}&y_{1}&\left. z_{1}\right\rangle \end{array}\right. \), with \(x_{0} \ne x_{1}, y_{0} \ne y_{1}\) and \(z_{0} \ne z_{1}\).

  2. 2.

    The agent shall not collide with any object during its transition in the target.

  3. 3.

    The agent should try to reach the shortest knownFootnote 1 route.

To satisfy all 3 missions, the agent needs more information from the image given by the depth camera. Therefore the classical deep architecture Q network should be transformed according to the needs of the problem. As result, instead of convolutional neural network, tests were done using 2 different cores: using a) Multilayer Perceptron Neural Network and b) Hybrid neural network which consists of: Convolutional Neural Network Core, Core Perceptron Multilayer Neural Network, Fully cohesive layer for combining features from cores.

The idea behind the multi-layer Perceptron neural network is that it can process the information it receives from the depth camera along with the necessary information about its location in the environment and the location of the target. The state of \(S_{t}\) agent at time t is a vector of: i) The preprocessing image of depth \(D_{t}\), ii) The position vector of the agent \(P_{t}=\left\langle \begin{array}{lll}x_{t}&y_{t}&z_{t}\end{array}\right\rangle \), iii) The position vector of the agent \(P_{t-1}=\left\langle \begin{array}{lll}x_{t-1}&y_{t-1}&z_{t-1}\end{array}\right\rangle \) in the previous time, iv) The position vector of the target \(\mathcal {T}=\left\langle x_{\mathcal {T}} \quad y_{\mathcal {T}} \quad z_{\mathcal {T}}\right\rangle \) and v) A scalar \(l_{t}\) floating point that informed the agent how many actions he has done up to time t. The agent could do up to 4 times more actions than the optimal estimated number of actions. The optimal estimated number of actions was obtained by summing the number of steps that the agent pays if he always chooses a reducing action among its distance from the target.

Preprocessing on the depth image consists of a basic image transformation in gray scale followed by a transformation to smaller dimensions and finally a crop of the result to its center so that it has the same dimension the length by the width. Thus, the depth image is transformed from \(3 \times 210 \times 160\) to \(84 \times 84\). Finally, the depth image is normalized to the range [0, 1] by dividing it by the maximum possible value which is 255. The result is the matrix \(D_{t}\).

The matrix \(D_{t}\) is transformed into a column vector \(D_{t}^{f l}\) and then joins with remaining input parameters \(P_{t}, P_{t-1}, \mathcal {T}\) and \(l_{t}\) shaping its state agent \(S_{t}\) at time t. The available actions are 6. In the experiments of the work, the displacement step was 1 m. The payoff at each step was given by Eq. (22).

$$\begin{aligned} R_{t}=\left\{ \begin{array}{ll} -100 &{} \text {in the event of a collision with an obstacle},\\ 100 &{} \text {upon completion of the mission},\\ -10 &{} \text {if }l_{t}<=0,\\ -10 &{} \text {in case of early landing}\\ \mathcal {D}_{t-1}-\mathcal {D}_{t} &{} \text {otherwise} \end{array}\right. \end{aligned}$$
(22)

In Eq. (22), the quantity \(D_{t}\) is the distance of the agent from the target in time moment t. The same quantity \(D_{t-1}\) is the distance of the agent from the target in time \(t-1\). The distance is calculated according to Eq. (23).

$$\begin{aligned} \mathcal {D}_{t}=P_{t}-\mathcal {T} \end{aligned}$$
(23)

4 Experimental Results

During the experiments the performance of the agent is evaluated across 16 different targets. To create a simulation interface, the software AirSim was usedFootnote 2. The results for the Multilayer Perceptron Neural Network are shown in Table 2. Moreover, the training evaluation is shown in Fig. 2. The results for the Hybrid Neural Network are shown in Fig. 3.

Table 2. Percentage of success/failure for action selection and mission completion.
Fig. 2.
figure 2

Results of the rainbow agent as per average cost and average reward.

The Perceptron multi-layer neural network core agent has been shown to work very satisfactory. However, it showed inability to make a decision in cases where was faced with an obstacle of great dimensions. To solve the problem, a hybrid architecture was tried. The entry of the new Q model consists of the matrix \(D_{t}\) and the vector \(V_{t}\) which is the concatenation of \(P_{t}\), \(P_{t-1}\), T and \(l_{t}\). The model Q consists of a deep convolutional neural network for efficient extraction features from the \(D_{t}\) matrix and from a deep multi-layer neural network Perceptron for feature extraction from vector \(V_{t}\). The characteristics of 2 cores are joined at a common plane, called the union plane. The level union is an input layer for a deep multilayer Perceptron neural network, the association model.

The computational resources required to train the agent were very high more than available. Notable was the maximum possible memory size of repeating experiences according to the system memory used for the experiments. The maximum experience memory size was 50,000 samples. Also, for for speed reasons, prioritized replay memory was implemented using segment treeFootnote 3. The partition tree belongs to the family of binary trees and therefore the agent instead of 50,000 items in memory could only load \(2^{\left\lfloor \log _{2} 50.000\right\rfloor }=2^{15}=32.768\) experiences. As a result, model training was not able to be completed fully.

Fig. 3.
figure 3

Training of multi-layer neural network as per feature extraction from \(V_{t}\).

After training the 2 kernels, it remained to train the union model. Keeping the parameters of the 2 cores fixed, during training they were adjusted only the union model parameters and the parameters in the value paths and advantage respectively. As in the previous experiments, it is stated that available computing resources were extremely limited for his needs problem and how the indicative number of training cycles, as can be seen in Fig. 1, is 44,000,000 cycles instead of the 450,000 training cycles run by model.

5 Conclusions and Future Work

Autonomous UAV navigation is an emerging concept in robotics where most environments are unknown. Therefore, techniques for automated navigation are increasing including modern AI systems that can analyse huge volumes of data to uncover patterns leading to successful solutions. Using reinforcement learning, the software learns a function that maximises the reward within an environment. Rainbow-type agents were chosen because of their exceptional performance in limited-action situations. As noted, an agent with high performance and optimum policy was built by using preprocessed image as input for pattern recognition from a tensor allowing a deep convolutional neural network to train using more data. Moreover, a better strategy may be established than the one utilised here to build the agent using a multi-layer neural network core like Perceptron as the current computing resources are insufficient to perform such tasks.

Future directions of this work include the construction of a model that utilizes a set of sequential images at its input, which contain all the information the agent needs in terms of the device, in terms of surroundings and in terms of the goal. Perhaps a convolutional model will emerge which can be designed to detect many different classes within an image to perform better than the convolutional core used for the experimental evaluation. Ultimately, fine-tuning optimizations include the use of more sophisticated algorithmic choices along with this work.