1 Introduction

1.1 Motivation

With the development of 5G technology, computation-intensive applications running on the user equipments (UEs), e.g., online gaming, VR/AR and telemedicine, will become more prosperous and popular. These mobile applications typically require substantial computing resources and incur high energy consumption. Nevertheless, current UEs generally have constrained computing resources and limited battery capacity. Mobile Cloud Computing (MCC) has emerged to augment the computing and storage capabilities of UE [1], and reduce the energy consumption of UE by offloading computation to the cloud via mobile networks. However, cloud servers are often spatially distant from the UE, which may incur high transmission delay [2] and adversely affect user experiences. To reduce the backhaul link delay, mobile computing has recently shifted to a new computation paradigm, i.e., Mobile Edge Computing (MEC) [3]. MEC can migrate cloud computing resources and services at a closer proximity to the UE, so as to effectively reduce communication delay and energy consumption [4].

Early studies on MEC mainly focused on enhancing the performance of a MEC system in which computation services are provided by a base station at a fixed position [5,6,7]. For example, Tran et al. [5] proposed a convex optimization method to optimize resource allocation and a low-complexity heuristic algorithm for task offloading in a multi-user multi-server MEC system. Zhao et al. [6] aimed at maximizing system utility with a Collaborative Computation Offloading and Resource Allocation Optimization (CCORAO) scheme in a cloud-assisted MEC system. To reduce the computation cost in a multi-user MIMO based MEC system, Chen et al.[7] proposed a deep reinforcement learning method to dynamically generate a continuous power allocation strategy. However, the MEC services provided by fixed infrastructures cannot effectively work in the scenarios where communication facilities are sparsely distributed or when sudden natural disasters occur.

Most recently, researchers have paid special attention to Unmanned Aerial Vehicle (UAV) because of its high mobility and flexible deployment [8]. They studied resource allocation or path planning in UAV-assisted MEC systems [9,10,11,12,13,14]. Aiming at reducing processing delay among UEs in the UAV-assisted MEC system, Hu et al. [9] developed an algorithm based on the penalty dual decomposition optimization framework, by jointly optimizing UAV trajectory, computation offloading and user scheduling. Diao et al. [10] designed a joint UAV trajectory and task data allocation optimization algorithm for energy saving. Cheng et al. [11] proposed a new edge computing architecture called UAV-assisted space-air-ground integrated network (SAGIN), and designed an actor-critic based Reinforcement Learning (RL) algorithm for resource allocation and task scheduling. Considering the dimensional curse of state space, Li et al. [12] adopted a RL algorithm to improve the throughput of UEs’ task migrations in a UAV-based MEC system. Xiong et al. [13] presented an optimization algorithm to reduce energy consumption by jointly optimizing offloading decision, bit allocation and UAV trajectory. Selim et al. [14] proposed to use Device-to-Device (D2D) as an additional option for auxiliary communication and computation offloading in UAV-MEC system. Despite extensive studies and applications, there still remain a lot of challenges in UAV-assisted MEC systems, e.g., the system performance is severely affected by the limitation on UE’s computation capability and the blockages [15] of environmental obstacles (e.g., trees or buildings), especially in urban areas. Thus, adaptive link selection and task offloading issues are very important in the UAV-assisted MEC system.

1.2 Novelty and contribution

In this paper, we consider a MEC system consisting of a UAV mounted with a nano server and a number of UEs, where the communication conditions are dynamic and time-varying. Unlike the Deep Q Network (DQN) based algorithms proposed for discrete action spaces [12], this paper designs a new Deep Deterministic Policy Gradient (DDPG) based computation offloading algorithm, which can effectively support a continuous action space of task offloading and UAV mobility. The main contributions of this paper can be summarized as follows.

  • Considering time-varying channel status in the time-slotted UAV-assisted MEC system, we jointly optimize user scheduling, UAV mobility and resource allocation, and formulate the non-convex computation offloading problem as a Markov Decision Process (MDP) problem to minimize the processing delay.

  • The complexity of the system states is very high when we consider the MDP models. Besides, the decisions of computation offloading needs the support to the continuous action space. Thus, we propose a novel DDPG-based computation offloading approach. DDPG is an advanced reinforcement learning algorithm, which uses an actor network to generate unique action and a critic network to approximate Q-value action function [16]. In this paper, DDPG algorithm is adopted to obtain the optimal policy for user scheduling, UAV mobility and resource allocation in our UAV-assisted MEC system.

  • The proposed DDPG algorithm is implemented on the TensorFlow platform. Simulation results with different system parameters demonstrate the effectiveness of this algorithm. Under different communication conditions, our algorithm can always achieve the best performance, as compared with other baseline algorithms.

The rest of this paper are organized as follows. Section 2 presents system models for our UAV-assisted MEC system, as well as the optimization problem. Section 3 briefly introduces the preliminaries on deep reinforcement learning. Section 4 proposes our DDPG-based computation offloading algorithm. Section 5 illustrates simulation results of our proposed algorithm, and compares it with other baseline algorithms. Finally, Sect. 6 concludes this work.

2 System model

We consider a UAV-assisted MEC system, which consists of K UEs and a UAV equipped with a nano MEC server. The whole system operates in discrete time with equal-length time slots [17, 18]. The UAV provides communication and computing services to all UEs, but only to one UE at a time [13]. Due to the constrained computing capability, the UE has to offload a portion of the computing tasks via wireless network to the MEC server at the UAV.

2.1 Communication model

The UAV provides computing services to all UEs in a time-division manner [13], in which the whole communication period T is divided into I time slots. We assume that the UEs move randomly in the area at a low speed. In each time slot, the UAV hovers in a fixed position, and then establishes communication with one of the UEs. After offloading a portion of its computing tasks to the server, the UE executes the remaining tasks locally. In the 3D Cartesian coordinate system [10], the UAV keeps flying at a fixed altitude H, where the UAV has a start coordinate \({\mathbf{q}}(i) = [x(i), \, y(i)]^{T} \in {\mathbf{R}}^{{{2} \times {1}}}\) and an end coordinate \({\mathbf{q}}(i + {1}) = [x(i + 1), \, y(i + 1)]^{T} \in {\mathbf{R}}^{{{2} \times {1}}}\) at time slot \(i \in \{ 1, \, 2, \, \ldots , \, I\}\). The coordinate of UE \(k \in \{ 1, \, 2, \, \ldots , \, K\}\) is \({\mathbf{p}}_{k} (i) = [x_{k} (i), \, y_{k} (i)]^{T} \in {\mathbf{R}}^{{{2} \times {1}}}\). The channel gain of the line-of-sight link between the UAV and the UE k can be expressed as [13]:

$$g_{k} (i) = \alpha_{0} d_{k}^{ - 2} (i) = \frac{{\alpha_{0} }}{{||{\mathbf{q}}(i + {1}) - {\mathbf{p}}_{k} (i)||^{2} + H^{2} }},$$
(1)

where \(\alpha_{0}\) denotes the channel gain at a reference distance \(d = 1{\text{ m}}\), and \(d_{k} (i)\) denotes the Euclidean distance between the UAV and k. Due to blockage from obstacles, the wireless transmission rate can be given as:

$$r_{k} (i) = B\log_{2} (1 + \frac{{P_{up} g_{k} (i)}}{{\sigma^{2} + f_{k} (i)P_{NLOS}^{{}} }}),$$
(2)

where B denotes the communication bandwidth, Pup denotes the transmit power of the UE in the upload link, σ2 denotes the noise power, PNLOS denotes the transmission loss [19], and \(f_{k} (i)\) denotes the indicator of whether there is any block between UAV and UE k at time slot i (i.e., 0 means no blockage and 1 means blockage) [15].

2.2 Computation model

In our MEC system, a partial offloading strategy is used for the UE’s tasks in each time slot [20]. Let \(R_{k} (i) \in [0, \, 1]\) denotes the ratio of tasks offloaded to the server, and \((1 - R_{k} (i))\) denotes the remaining tasks that are locally executed at the UE. Then, the local execution delay of UE k at time slot i can be given as:

$$t_{{local,{\text{ }}k}} (i) = \frac{{(1 - R_{k} (i))D_{k} (i)s}}{{f_{{UE}} }}$$
(3)

where Dk(i) denotes the computing task sizes of UE k, s indicates the CPU cycles required to process each unit byte, and fUE denotes the UE’s computing capability. At i-th time slot, UAV flies from position \({\mathbf{q}}(i)\) to the new hovering position \({\mathbf{q}}(i + 1) = [x(i) + v(i)t_{fly} \cos \beta (i), \, y(i) + v(i)t_{fly} \sin \beta (i)]^{T}\) at a speed \(v(i) \in [0, \, v_{\max } ]\) and an angle \(\beta (i) \in [0,{ 2}\pi ]\). The energy consumed by this flight can be expressed as:

$$E_{{fly}} (i) = \phi ||v(i)||^{2}$$
(4)

where \(\phi = 0.5M_{UAV} t_{fly}\)[9], M is relevant to the UAV’s payload, tfly is fixed flight time. In the MEC system, the size of computation results provided by the server is usually very small and negligible [9]. Thus, the transmitting delay for the downlink is not taken into consideration here. The processing delay of MEC server can be divided into two parts. One part is the transmission delay, which can be given as:

$$t_{tr, \, k} (i) = \frac{{R_{k} (i)D_{k} (i)}}{{r_{k} (i)}}.$$
(5)

The other part is the delay resulting from the computation at the MEC server, which can be given as:

$$t_{{UAV,{\text{ }}k}} (i) = \frac{{R_{k} (i)D_{k} (i)s}}{{f_{{UAV}} }}$$
(6)

where fUAV denotes the computation frequency of server CPU. Correspondingly, the energy consumed to offload the computing tasks to the server in time slot i can also be divided into two parts, including the one for transmission and the other one for computation. When the computation is performed at the MEC server, its power consumption can be expressed as:

$$P_{UAV, \, k} (i) = \kappa f_{UAV}^{3} .$$
(7)

Then, the energy consumption of the MEC server at time slot i is given as:

$$E_{UAV, \, k} (i) = P_{UAV, \, k} (i)t_{UAV, \, k} (i) = \kappa f_{UAV}^{2} R_{k} (i)D_{k} (i)s.$$
(8)

2.3 Problem formulation

Based on the models introduced above, we formulate the optimization problem in our UAV-assisted MEC system. To ensure efficient utilization of the limited computation resources, we aim at minimizing the maximum processing delay for all UEs, by jointly optimizing user scheduling, UAV mobility and resource allocation in the system. Specifically, the optimization problem can be denoted as:

$$\mathop {\min }\limits_{\{ \alpha_{k} (i), \, {\mathbf{q}}(i + 1), \, R_{k} (i)\} } \sum\limits_{i = 1}^{I} {\sum\limits_{k = 1}^{K} {\alpha_{k} (i)\max \{ t_{local, \, k} (i),t_{UAV, \, k} (i) + t_{tr, \, k} (i)\} } }$$
(9a)
$${\text{s.t. }}\alpha_{k} (i) \in \{ 0, \, 1\} , \, \forall i \in \{ 1, \, 2, \, \ldots , \, I\} , \, k \in \{ 1, \, 2, \, ..., \, K\} ,$$
(9b)
$$\sum\limits_{k = 1}^{K} {\alpha_{k} (i) = 1} , \, \forall i,$$
(9c)
$$0 \le R_{k} (i) \le 1, \, \forall i, \, k,$$
(9d)
$${\mathbf{q}}(i) \in \{ (x(i), \, y(i))|x(i) \in [0, \, L], \, y \in [0, \, W]\} , \, \forall i,$$
(9e)
$${\mathbf{p}}(i) \in \{ (x_{k} (i), \, y_{k} (i))|x_{k} (i) \in [0, \, L], \, y_{k} \in [0, \, W]\} , \, \forall i, \, k,$$
(9f)
$$f_{k} (i) \in \{ 0, \, 1\} , \, \forall i, \, k,$$
(9g)
$$\sum\limits_{i = 1}^{I} {(E_{fly, \, k} (i) + E_{UAV, \, k} (i)) \le E_{b} , \, \forall } k,$$
(9h)
$$\sum\limits_{i = 1}^{I} {\sum\limits_{k = 1}^{K} {\alpha_{k} (i)} D_{k} (i) = D, \, }$$
(9i)

where Constraint (9b) and constraint (9c) guarantee that only one user is scheduled for computing in time slot i. Constraint (9d) denotes the range of values for the offloading ratio of the computing task. Constraints (9e) and (9f) show that UE and UAV can only move in the given area. Constraint (9g) represents the blockage in the wireless channel between the UAV and the UE at time slot i. Constraint (9h) ensures that the energy consumption of UAV flight and calculation in all time slots does not exceed the maximum battery capacity. Constraint (9i) specifies all of computing tasks to be completed in the whole time period.

3 DDPG based computation offloading optimization

In this section, we first introduce fundamental knowledge of MDP, Q-Learning, DQN, and DDPG which are important emerging RL technologies. Then, we discuss how to utilize DDPG to train an efficient computation offloading policy in our UAV-assisted MEC system. In detail, we define the state space, the action space and the reward function, describe the state normalization for data preprocessing, and illustrate the procedures of training algorithm and testing algorithm.

3.1 A primer on RL

3.1.1 MDP

MDP [21] is a mathematical framework to describe the discrete time stochastic control process in which outcomes are partly random and under control of an agent or a decision maker. It formally describes an environment which is fully observable for reinforcement learning. Typically, a MDP can be defined as a tuple \(({\mathcal{S}}, \, {\mathcal{A}}, \, p(., \, .), \, r)\), where \({\mathcal{S}}\) is the state space, \({\mathcal{A}}\) is the action space,\(p(s_{i + 1} |s_{i} , \, a_{i} )\) is the transition probability from the current state \(s_{i} \in {\mathcal{S}}\) to the next \(s_{i + 1} \in {\mathcal{S}}\) after action \(a_{i} \in {\mathcal{A}}\) is executed, and \(r \, : \, {\mathcal{S}} \times {\mathcal{A}} \to {\mathcal{R}}\) is instantaneous/immediate reward function. We denote \(\pi { : }{\mathcal{S}} \to {\mathcal{P}}{(}{\mathcal{A}}{)}\) as a “policy” which is a mapping from a state to an action. MDP aims to find an optimal policy which can maximizes the expected accumulated rewards:

$$R_{i} = \sum\limits_{l = i}^{\infty } {\gamma^{l - i} r_{l} } ,$$
(10)

where \(\gamma \in [0, \, 1]\) is the discount factor, \(r_{l} = r(s_{l} , \, a_{l} )\) is the immediate reward at l-th time slot. Under policy \(\pi\), the expected discounted return from state \(s_{i}\) is defined as a state value function, i.e.,

$$V_{\pi } (s_{i} ) = {\mathbb{E}}_{\pi } [R_{i} |s_{i} ].$$
(11)

Similarly, the expected discounted return after taking action ai in state si under a policy \(\pi\) is defined as an action value function, i.e.,

$$Q_{\pi } (s_{i} ) = {\mathbb{E}}_{\pi } [R_{i} |s_{i} , \, a_{i} ].$$
(12)

Based on the Bellman equation, the recursive relationship of the state value function and the action value function can be expressed as, respectively,

$$V_{\pi } (s_{i} ) = {\mathbb{E}}_{\pi } [r(s_{i} , \, a_{i} ) + \gamma V_{\pi } (s_{i + 1} )],$$
(13)
$$Q_{\pi } (s_{i} , \, a_{i} ) = {\mathbb{E}}_{\pi } [r(s_{i} , \, a_{i} ) + \gamma Q_{\pi } (s_{i + 1} , \, a_{i + 1} )].$$
(14)

Since we aim to find the optimal policy \(\pi *\), the optimal action in each state can be found by the optimal value function. The optimal state value function can be expressed as:

$$V_{*} (s_{i} ) = \mathop {\max }\limits_{{a_{i} \in {\mathcal{A}}}} {\mathbb{E}}_{\pi } [r(s_{i} , \, a_{i} ) + \gamma V_{\pi } (s_{i + 1} )].$$
(15)

The optimal action value function also follows the optimal policy \(\pi *\), we can write \(Q_{*}\) in terms of \(V_{*}\) as follows:

$$Q_{*} (s_{i} , \, a_{i} ) = {\mathbb{E}}_{\pi } [r(s_{i} , \, a_{i} ) + \gamma V_{*} (s_{i + 1} )].$$
(16)

3.1.2 Q-learning

RL is an important branch of machine learning, in which an agent tries to get the maximum return by interacting with a control environment to its optimal state. Although RL is often used to solve the optimization problems of MDPs, the underlying transmission probability \(p(s_{i + 1} |s_{i} , \, a_{i} )\) is unknown or even non-stationary. In RL, an agent tries to get the maximum return by interacting with a control environment and adjusting its action through previous experience. Q-learning [21] is a popular and effective method in RL, and it is an off-policy Temporal Difference (TD) control algorithm. The Bellman optimality equation for the state-action function, i.e., the optimal Q function, can be expressed as:

$$Q_{*} (s_{i} , \, a_{i} ) = {\mathbb{E}}_{\pi } [r(s_{i} , \, a_{i} ) + \gamma \mathop {\max }\limits_{{a_{i + 1} }} Q_{*} (s_{i + 1} , \, a_{i + 1} )].$$
(17)

The optimal value of the Q function can be found by an iterative process. The agent learns from experience tuple \((s_{i} , \, a_{i} , \, r_{i} , \, s_{i + 1} )\) and the Q function can be updated at time step i as follows:

$$Q(s_{i} , \, a_{i} ) \leftarrow Q(s_{i} , \, a_{i} ) + \alpha [r(s_{i} , \, a_{i} ) + \gamma \mathop {\max }\limits_{{a_{i + 1} }} Q(s_{i + 1} , \, a_{i + 1} ) - Q(s_{i} , \, a_{i} )],$$
(18)

where α is the learning rate, and \(r(s_{i} , \, a_{i} ) + \gamma \max_{{a_{i + 1} }} Q(s_{i + 1} , \, a_{i + 1} )\) is the predicted Q value and \(Q(s_{i} , \, a_{i} )\) is current Q value. The difference between the predicted Q value and the current Q value is a TD error. When the appropriate learning rate is selected, the Q learning algorithm converges [22].

3.1.3 DQN

The Q-learning algorithm updates the Q value of each item in the state action space by maintaining a look-up table, which is suitable for small state-action space. Considering the complexity of system models in practice, these spaces are usually very large. The reason is that a large number of states are rarely accessed and the corresponding Q values are rarely updated, resulting in a longer convergence time of the Q function. By combining deep neural networks (DNNs) with Q-learning, DQN [23] addresses the shortcomings of Q-learning. The core idea behind DQN is to use a DNN parameterized by θ to get the approximate Q value \(Q(s, \, a)\) instead of Q table, i.e., \(Q(s, \, a|\theta ) \approx Q_{*} (s, \, a)\).

However, the stability of RL algorithm using DNN cannot be guaranteed, which is stated in [23]. In order to solve this problem, two mechanisms are employed. The first one is experience replay. At each time step i, the agent’s interactive experience tuples \((s_{i} , \, a_{i} , \, r_{i} , \, s_{i + 1} )\) are stored in the experience replay buffer, i.e., experience pool Bm. Then, a small number of samples, i.e., mini-batches, are randomly selected from the experience pool to train the parameters of the deep neural network, instead of directly using continuous samples for training. The second stabilization method is to use a target network that initially contains the weights of the network that sets the policy, but remains frozen for a fixed number of time steps. The target Q network is updated slowly, but the value of the main Q network is updated frequently. In this way, the correlation between the target and the estimated Q value is significantly reduced, which makes the algorithm stable.

In each iteration, the deep Q function is trained toward the target value by minimizing the loss function L(θ). The loss function can be written as:

$$L(\theta ) = {\mathbb{E}}[(y - Q(s, \, a|\theta ))^{2} ],$$
(19)

where the target value y is expressed as \(y = r + \gamma \max_{{a^{\prime}}} Q(s^{\prime}, \, a^{\prime}|\theta_{i}^{ - } )\). In the Q-learning, the weights \(\theta_{i}^{ - } = \theta_{i - 1}\), whereas in deep Q-learning \(\theta_{i}^{ - } = \theta_{i - X}\), i.e., the target network weights update every X time steps.

3.1.4 DDPG

Although the DQN algorithm can solve the problem of high-dimensional state spaces, it still can't deal with the continuous action spaces. As a model-free off-policy actor-critic algorithm using DNN, DDPG [16] algorithm can learn polices in continuous action spaces. The actor-critic algorithm is composed of a policy function and a Q-value function. The policy function acts an actor to generate actions. The Q-value function acts as a critic to evaluate the actor’s performance and direct the follow-up action of the actor.

As shown in Fig. 1, DDPG uses two different DNNs to approximate the actor network \(\mu (s|\theta^{\mu } )\), i.e., the policy function and the critic network \(Q(s, \, a|\theta^{Q} )\), i.e., the Q-value function, respectively. In addition, both the actor network and the critic network contain a target network with the same structure as them: actor target network \(\mu^{\prime}\) with parameter \(\theta^{{\mu^{\prime}}}\), critic target network \(Q^{\prime}\) with parameter \(\theta^{{Q^{\prime}}}\). Similar to DQN, critic network \(Q(s,a|\theta^{Q} )\) can be updated as follows:

$$L(\theta^{Q} ) = {\mathbb{E}}_{{\mu^{\prime}}} [(y_{i} - Q(s_{i} , \, a_{i} |\theta^{Q} ))^{2} ],$$
(20)
Fig. 1
figure 1

Diagram of DDPG

where

$$y_{i} = r_{i} + \gamma Q(s_{i + 1} , \, \mu (s_{i + 1} )|\theta^{Q} ).$$
(21)

As has been proved by Silver et al. [24], the policy gradient can be updated by chain rule,

$$\begin{gathered} \nabla_{{\theta^{\mu } }} J \approx {\mathbb{E}}_{{\mu^{\prime}}} [\nabla_{{\theta^{\mu } }} Q(s, \, a|\theta^{Q} )|_{{s = s_{i} , \, a = \mu (s_{i} |\theta^{\mu } )}} ] \hfill \\ { = }{\mathbb{E}}_{{\mu^{\prime}}} [\nabla_{a} Q(s, \, a|\theta^{Q} )|_{{s = s_{i} , \, a = \mu (s_{i} |\theta^{\mu } )}} \nabla_{{\theta^{\mu } }} \mu (s|\theta^{\mu } )|_{{s = s_{i} }} ]. \hfill \\ \end{gathered}$$
(22)

The entire training process for the DDPG algorithm can be summarized as follows. Firstly, the actor network \(\mu\) outputs \(\mu (s_{i} )\) after the previous training step. In order to provide adequate exploration of the state space, we need to balance the tradeoff between exploration and exploitation. Actually, we can treat the exploration of DDPG independently from the learning process, as DDPG is an off-policy algorithm. Therefore, we construct the action space by adding behavior noise \(n_{i}\) to obtain action \(a_{i} = \mu (s_{i} ) + n_{i}\), where ni obeys the Gaussian distribution \(n_{i} \sim {\mathbb{N}}(\mu_{e} , \, \sigma_{e, \, i}^{2} )\), \(\mu_{e}\) is the mean and \(\sigma_{e, \, i}^{{}}\) is the standard deviation. After performing at in the environment, the agent can observe the next state \(s_{i + 1}\) and the immediate reward rt. Then the transition \((s_{i} , \, a_{i} , \, r_{i} , \, s_{i + 1} )\) is stored in the experience replay buffer. After that, the algorithm randomly selects N transitions \((s_{j} , \, a_{j} , \, r_{j} , \, s_{j + 1} )\) in the buffer to make up a mini-batch and inputs it into actor network and critic network. With the mini-batch, the actor target network \(\mu^{\prime}\) outputs the action \(\mu^{\prime}(s_{j + 1} )\) to the critic target network \(Q^{\prime}\). With mini-batch and \(\mu^{\prime}(s_{j + 1} )\), the critic network can calculate the target value \(y_{j}\) based on (21).

In order to minimize the loss function, the critic network \(Q\) will be updated by a given optimizer, e.g., Adam Optimizer. Afterwards, the actor network \(\mu\) gives the mini-batch action \(a = \mu (s_{j} )\) to the critic network to achieve the action a’s gradient \(\nabla_{a} Q(s, \, a|\theta^{Q} )|_{{s = s_{j} , \, a = \mu (s_{j} )}}\). The parameter \(\nabla_{{\theta^{\mu } }} \mu (s|\theta^{\mu } )|_{{s = s_{j} }}\) can be derived by its own optimizer. Using these two gradients, the actor network can be updated with the following approximation:

$$\nabla_{{\theta^{\mu } }} J \approx \frac{1}{N}\sum\limits_{j} {[\nabla_{a} Q(s, \, a|\theta^{Q} )|_{{s = s_{j} , \, a = \mu (s_{j} |\theta^{\mu } )}} \nabla_{{\theta^{\mu } }} \mu (s|\theta^{\mu } )|_{{s = s_{j} }} } ].$$
(23)

Finally, the DDPG agent uses a small constant \(\tau\) to update the critic target network and actor target network softly:

$$\theta^{{Q^{\prime}}} \leftarrow \theta^{Q} + (1 - \tau )\theta^{{Q^{\prime}}} ,$$
(24)
$$\theta^{{\mu^{\prime}}} \leftarrow \theta^{\mu } + (1 - \tau )\theta^{{\mu^{\prime}}} .$$
(25)

4 DDPG-based algorithm

4.1 State space

In the UAV-assisted MEC system, the state space is jointly determined by K UEs, a UAV and their environment. The system state at time slot i can be defined as:

$$s_{i} = (E_{battery} (i), \, {\mathbf{q}}(i), \, {\mathbf{p}}_{1} (i), \, ..., \, {\mathbf{p}}_{K} (i), \, D_{remain} (i), \, D_{1} (i), \, ..., \, D_{K} (i), \, f_{1} (i), \, ..., \, f_{K} (i)),$$
(26)

where Ebattery(i) denotes the remaining energy of UAV battery at i-th time slot, \({\mathbf{q}}(i)\) denotes the UAV location information, \({\mathbf{p}}_{k} (i)\) denotes the location information of UE k served by the UAV, Dremain(i) denotes the size of remaining tasks that the system needs to complete in the whole time period, Dk(i) denotes the task size randomly generated by UE k in the i-th slot, and \(f_{k} (i)\) denotes an indication of whether the signal of UE k is blocked by obstacles. Especially when \(i = 1\), \(E_{battery} (i) = E_{b}\) and \(D_{remain} (i) = D\).

4.2 Action space

Based on the current state of the system and the observed environment, the agent selects actions including UE \(k^{\prime}\) to be served, flight angle of UAV, flight speed of UAV and task offloading ratio in time slot i. The action ai can be denoted as:

$$a_{i} = (k(i), \, \beta (i), \, v(i){, }R_{k} (i)).$$
(27)

It is worth noting that the actor network in DDPG outputs continuous actions. The action variable UE \(k(i) \in [0, \, K]\) selected by the agent needs to be discretized, i.e., if \(k(i) = {0}\), then \(k^{\prime} = {1}\); if \(k(i) \ne {0}\), then \(k^{\prime} = \left\lceil {k(i)} \right\rceil\), where \(\left\lceil \cdot \right\rceil\) is the ceiling operation. The flight angle of UAV, the flight speed of UAV and the task offloading ratio can be accurately optimized in a continuous action space, i.e., \(\beta (i) \in [0,{ 2}\pi ]\), \(v(i) \in [0, \, v_{\max } ]\), and \(R_{k} (i) \in [0, \, 1]\). The above four variables are jointly optimized to minimize system cost.

4.3 Reward function

The agent’s behavior is reward-based, and the choice of an appropriate reward function plays a vital role in the performance of the DDPG framework. We aim to maximize reward by minimizing the processing delay defined in the problem (9), as follows:

$$r_{i} = r(s_{i} , \, a_{i} ) = - \tau_{delay} (i),$$
(28)

where the processing delay at time slot i is \(\tau_{delay} (i) = \sum\nolimits_{k = 1}^{K} {\alpha_{k} (i)\max \{ t_{local, \, k} (i), \, t_{UAV, \, k} (i) + t_{tr, \, k} (i)\} }\), and if \(k = k^{\prime}\), then \(\alpha_{k} (i) = 1\); otherwise \(\alpha_{k} (i) = {0}\). With the DDPG algorithm, the action to maximize the Q value can be found. The long-term average reward of system can be expressed using the Bellman equation as follows:

$$Q_{\mu } (s_{i} , \, a_{i} ) = {\mathbb{E}}_{\mu } [r(s_{i} , \, a_{i} ) + \gamma Q_{\mu } (s_{i + 1} , \, \mu {(}s_{i + 1} ))].$$
(29)

4.4 State normalization

In the process of DNN training, the distribution of input in each layer changes with the change of the previous layer parameters, which slows down the training by requiring lower learning rates and careful parameter initialization. Ioffe and Szegedy proposed a batch normalization mechanism that allows training to use a higher learning rate and be less careful with initialization. We propose a state normalization algorithm to preprocess the observed states, so that DNN can be trained more effectively. It is worth noting that, different from Qiu's state normalization algorithm [25], the proposed algorithm takes the difference between maximum and minimum of each variable as the scaling factor. The proposed state normalization algorithm can solve the problem of magnitude difference of input variables.

In our work, the variables \(E_{battery} (i), \, {\mathbf{q}}(i), \, {\mathbf{p}}_{1} (i), \, ..., \, {\mathbf{p}}_{K} (i), \, D_{remain} (i), \, D_{1} (i), \, ..., \, D_{K - 1} (i)\) and \(D_{K} (i)\) in the state set lie in different ranges, which may result in problems during training. As shown Algorithm 1,those variables are normalized by state normalization to prevent such a problem. We use five scaling factors in the state normalization algorithm. Each factor can be explained as follows. The scaling factor \(\gamma_{b}\) is used to scale down the UAV battery capacity. Since both UAV and UE have the same x and y coordinate ranges, we use \(\gamma_{x}\) and \(\gamma_{y}\) to scale down the x and y coordinates of the UAV and UE respectively. We use \(\gamma_{{D_{rm} }}\) to scale down the remaining tasks in the whole time period, and \(\gamma_{{D_{UE} }}\) to scale down the task size of each UE in time slot i.

4.5 Training and testing

To learn and evaluate the DDPG-based computation offloading algorithm, there are two phases including training and testing. The training algorithm based on DDPG for computation offloading is shown in Algorithm 2. In the training process, the critic network parameters and the actor network parameters of the training behavior policy are iteratively updated. Algorithm 3 describes computation offloading testing process, which uses the trained actor network \(\theta^{\mu }\) in Algorithm 2. It should be noted that since the actor network is trained with normalized state, we also need to preprocess the input state in the testing process.

figure e
figure f
figure g

5 Results and analysis

In this section, we illustrate the proposed DDPG-based framework for computation offloading in the UAV-assisted MEC system by numerical simulations. Firstly, the setting of simulation parameters is introduced. Then, the performance of the DDPG-based framework is validated under different scenarios, and compared with other baseline schemes.

5.1 Simulation setting

In the UAV-assisted MEC system, we consider a 2D square area with \(K = 4\) UEs randomly distributed in \(L \times W = {100 } \times \, 100{\text{ m}}^{{2}}\) [10], and assume that the UAV flies at a fixed height \(H = 100{\text{ m}}\) [11]. As defined in [26], the gross mass of UAV is \(M_{UAV} = 9.65{\text{ kg}}\). The whole time period \(T = 400{\text{ s}}\) is divided into I = 40 time slots. Referring to [9], the maximum UAV flight speed is \(v_{\max } = {\text{50 m/s}}\) and the flight time of UAV is \(t_{fly} = 1{\text{ s}}\) in each time slot. The channel power gain is set to be \(\alpha_{{0}} = - {5}0{\text{ dB}}\) [9] at a reference distance of 1 m. The transmission bandwidth is set to \(B = 1{\text{ MHz}}\). The noise power of the receiver is assumed to be \(\sigma^{2} = - 100{\text{ dBm}}\) [9] without signal blockage. If the signal is blocked during transmission between UAV and UE k, i.e., \(f_{k} (i) = 1\), the signal.

penetration loss is \(P_{NLOS} = 20{\text{ dB}}\) [19]. We assume that the transmission power of UEs \(P_{up} = 0.1{\text{ W}}\) [22], UAV battery capacity \(E_{b} = 500{\text{ kJ}}\) [26] and the required CPU cycles per bit \(s = 1000{\text{ cycles/bit}}\) [9]. The computing capability of the UE and the MEC server is set to \(f_{UE} = 0.6{\text{ GHz}}\) and \(f_{UAV} = 1.2{\text{ GHz}}\) [9], respectively. The scaling factors in the proposed state normalization algorithm are set to \(\gamma_{b} = 5 \times 10^{5}\), \(\gamma_{x} = 100\), \(\gamma_{y} = 100\), \(\gamma_{{D_{rm} }} = 1.05 \times 10^{8}\) and \(\gamma_{{D_{UE} }} { = 2}{\text{.62}} \times {10}^{{6}}\) respectively. The detailed simulation parameters are listed in Table 1, unless otherwise specified. In our experiments, the average reward taken from multiple runs of Algorithm 3 with same settings are used for performance comparison.

Table 1 Simulation parameters

For comparison, four baseline approaches are described as follows:

  • Offloading all tasks to UAV (Offload-only): During each time slot, the UAV provides computing services to the UE at a fixed location in the area center. The UE offloads all its computing tasks to the MEC server on UAV.

  • Executing all tasks locally (Local-only): Without the assistance of UAV, all computing tasks of UE are executed locally.

  • Actor Critic-based computation offloading algorithm (AC): To evaluate the performance of the proposed DDPG-based computation offloading algorithm, the continuous action space based RL algorithm, i.e., AC [11], is also implemented for the computation offloading problem. In order to compare it with DDPG fairly, AC also applies state normalization.

  • DQN-based computation offloading algorithm (DQN): The traditional discrete action space based DQN [12] algorithm is compared with the proposed DDPG-based algorithm. During the UAV flight, the angle levels are defined as \({\mathcal{B}} = \{ 0, \, \pi {/5}, \, \ldots {, 2}\pi \}\), and the speed levels are expressed as \({\mathcal{V}} = \{ 0, \, v_{\max } /10, \, \ldots {, }v_{\max } \}\) and the offloading ratio levels can be set as \({\mathcal{O}} = \{ 0, \, 0.1, \, \ldots {, }1.0\}\). In order to compare it with DDPG and AC fairly, DQN also applies state normalization.

5.2 Simulation results and discussions

5.2.1 Parametric analysis

We first conduct a series of experiments to determine the optimal values for important hyper-parameters used in algorithm comparisons. The convergence performance of the proposed algorithm with different learning rates is shown in Fig. 2. We assume that the learning rates of the critic network and actor network are different. Firstly, we can clearly observe that when \(\alpha_{Actor} = 0.1\), \(\alpha_{Critic} = 0.{2}\) or \(\alpha_{Actor} = 0.001\), \(\alpha_{Critic} = 0.00{2}\), the proposed algorithm can converge. However, the proposed algorithm converges to a local optimal solution when \(\alpha_{Actor} = 0.1\) and \(\alpha_{Critic} = 0.{2}\). The reason is that the large learning rates will make both critic network and actor network take a large update step. Secondly, we can find that when the learning rate is very small, i.e., \(\alpha_{Actor} = 0.00{00}1\) and \(\alpha_{Critic} = 0.00{002}\), the algorithm cannot converge. That is because lower learning rates will result in a slower update rate of DNNs, which requires more iterative episodes to converge. Thus, the best learning rates of actor network and critic network are \(\alpha_{Actor} = 0.001\) and \(\alpha_{Critic} = 0.00{2}\), respectively.

Fig. 2
figure 2

Convergence performance of the DDPG-based algorithm under different learning rates

In Fig. 3, we compare the impact of different discount factors \(\gamma\) on the convergence performance of the proposed algorithm. It is observed that when the discount factor \(\gamma = 0.001\), the trained computation offloading policy has the best performance. The reason is that the environment in different periods varies greatly, so the data of the whole time period cannot completely represent the long-term behavior. A larger \(\gamma\) means that the Q table will consider the data collected in the whole time period as long-term data, which leads to the poor generalization ability of different time periods. Therefore, an appropriate value of \(\gamma\) will improve the ultimate performance of our trained policies, and we set the discount factor \(\gamma\) to 0.001 in the following experiments.

Fig. 3
figure 3

Convergence performance with different discount factors

Figure 4 shows the performance comparisons of the proposed algorithm in terms of the processing delay under different exploration parameter \(\sigma_{e}\). The convergence performance of the proposed algorithm is greatly affected by this exploration parameter. When the algorithm converges at \(\sigma_{e} = 0.1\), the optimal delay fluctuates at 63 s. A larger value of \(\sigma_{e}\) produces a larger random noise distribution space, which makes it possible to explore a larger spatial range for the agent. When \(\sigma_{e} = 0.001\), the performance of the algorithm decreases at 850 episodes, as the algorithm falls into a local optimal solution due to a small \(\sigma_{e}\). Therefore, a large number of experiments are needed to obtain the appropriate exploration setting in the UAV-assisted scenario. Hence, we choose \(\sigma_{e} = 0.01\) for better performance in the following experiments.

Fig. 4
figure 4

Performance of the delay under different exploration parameter settings

Figure 5 shows the influence of training strategies that do not use state normalization or behavior noise in DDPG training algorithms. On the one hand, if the DDPG algorithm is trained without behavior noise, the convergence speed of the algorithm will be slower. On the other hand, if the policy is trained without state normalization, i.e., without introducing scale factor into the state normalization, the training algorithm will not work. The reason is that without the state normalization strategy, the values of Ebattery(i), Dremian(i) and Dk(i) are too large, which results in the random initialization of DNNs to output a larger value. Thus, if the state normalization strategy proposed by us is not adopted in DDPG algorithm, the algorithm will eventually become a greedy algorithm.

Fig. 5
figure 5

Performance without state normalization or behavior noise

5.2.2 Performance comparison

Figure 6 shows the performance comparison between different algorithms. In Fig. 6a, we trained the DNNs of RL algorithms for a total of 1,000 episodes. From this figure, it can be observed that the AC algorithm cannot converge as the number of iteration increases, while both DQN and DDPG algorithms can achieve convergence. That’s because the AC algorithm suffers from the problem of simultaneous updating of the actor network and the critic network. The action selection of the actor network depends on the value function of the critic network, but the critic network itself is difficult to converge. Therefore, the AC algorithm may not converge in some cases. In contrast, both DQN and DDPG benefit from the dual network structure of an evaluation network and a target network, which can be used to cut off the correlation between training data to find the optimal action strategy. Using the delay results obtained after the algorithm convergence, we compare these algorithms under different settings of task size and show the results in Fig. 6b. In Fig. 6b, the delay of the DDPG algorithm is always the lowest among the five algorithms for the same task size. Due to the exploration of discrete action space and non-negligible spaces between the available actions, the DQN cannot accurately find the optimal offloading strategy. However, the DDPG algorithm explores a continuous action space and takes a precise action, which finally obtains the optimal strategy and significantly reduces the delay. Besides, the DQN algorithm converges with a processing delay much higher than DDPG. The Offload-only and Local-only algorithms cannot fully utilize the computing resources in the whole system. Thus, for the same task size, the processing delay of the DDPG algorithm is significantly lower than those of Offload-only and Local-only. Moreover, as the task size increases, the increasing speed of processing delay optimized by the DDPG algorithm is much slower than those of Offload-only and Local-only, which indicates the advantages of our proposed algorithm.

Fig. 6
figure 6

(a) Performance of different algorithms under task sizes D = 100 Mb. (b) Delay with different task sizes

Figure 7a and b show the performance of the same group of experiments in terms of delay and offloading ratio. Figure 7a shows the convergence performance of DQN scheme and DDPG scheme with different computing capabilities of UE. The reason why the proposed scheme is not compared with the AC scheme is that the AC scheme is still not convergent. We can find that when the computing capability of UE is small, i.e., \(f_{UE} = 0.{\text{4 GHz}}\), the processing delay optimized by the two optimization schemes is higher than that when the computing capability of UE \(f_{UE} = 0.{\text{6 GHz}}\). On the other hand, when the computing capability of UE is large, the average offloading ratio in the system is small according to Fig. 7b, and thus the UE prefers to execute the task locally. The smaller the computing capability of UE, the slower the data processing speed of the system in the same time, resulting in the larger maximum delay between local execution and offloading. Figure 7c shows the optimized delay comparison between the proposed scheme and DQN scheme under different CPU frequency conditions. It can be observed from Fig. 7c that the proposed DDPG scheme achieves lower delay compared with DQN scheme under different UE computing capabilities. This reason is that, DDPG scheme can output a number of continuous actions, instead of a limited discrete set of actions in DQN. Thus, DDPG can find an accurate factor that affects greatly the delay in the continuous action control system, i.e., the offloading ratio.

Fig. 7
figure 7

(a) Performance of DQN and DDPG under different computing capabilities of UE. (b) Offloading ratio of DQN and DDPG under different computing capabilities of UE. (c) Comparison between DQN and DDPG

In Fig. 8, we compare the average processing delay between the proposed DDPG scheme, DQN, Offload-only and Local-only, as the number of UEs varies from 1 to 10. We assume that the total task size to be completed in a time period is the same under different number of UEs. As shown in Fig. 8, the average processing delay of all schemes except DQN is almost constant as the number of UEs increases. With the increase of the number of UEs, the processing delay of DQN scheme fluctuates at about 86 s. The reasons can be explained as follows. The value ranges of the DQN output actions under different UE numbers vary greatly. Thus, when the sample is used as the input of DNN training, DNNs may prefer to output a larger value. The actor network of DDPG outputs multi-dimensional actions to ensure that the input data of DNNs are in the same range, i.e., [0, 1], which ensures the convergence and stability of DDPG algorithm. Besides, the proposed DDPG scheme achieves the lowest delay. This is due to the fact that DDPG scheme can find the optimal value in the continuous action and obtain the optimal control policy.

Fig. 8
figure 8

Comparison of processing delay as the number of UEs varies from 1 to 10

6 Conclusion

In this paper, we study computation offloading in a UAV-assisted MEC system, where a flying UAV provides communication and computing services for UEs. We aim to minimize the sum of the maximum processing delays in the whole time period by jointly optimizing user scheduling, task offloading ratio, UAV flight angle and flight speed. In order to solve the nonconvex optimization problem with discrete variables, we introduce the DDPG training algorithm to obtain the optimal offloading strategy. We describe the RL related background knowledge and the DDPG training process. Then, the state normalization algorithm is proposed to make the DDPG algorithm easier to converge. We analyze the parameters of DDPG algorithm through simulation, and compare the impacts of different parameters, including learning rate, discount factor and training strategies. The simulation results show that, as compared with the baseline algorithms, our proposed algorithm can achieve better performance in terms of processing delay. For future work, we will investigate the performance of our algorithm with other popular RL algorithm, e.g., ACER, ACKTR, PRO2 and TRPO [21, 27].