Keywords

1 Introduction

Reliable and timely sensory information by ground sensor nodes (SNs) is critical to applications in Internet of Things (IoT). It is generally challenging for the SNs with limited battery capacity to communicate reliably over long distances. In recent years, unmanned aerial vehicles (UAVs) are routinely used as mobile data collectors in IoT due to their high mobility and easy deployment. The age of information (AoI) is a measure to SNs’ information freshness. In [1], AoI was defined as the time elapsed from the generation of the latest packet by a source node to its reception by a target node. For AoI-oriented UAV data collection, [2] designed an online flight trajectories of the UAV based on deep reinforcement learning (DRL) method to minimize the SNs’ weighted sum of AoIs, and [3] studied the influence of SNs’ sampling and the queueing process on the SNs’ average AoI. The above works only account for the Line of Sight (LoS) channel between the UAV and the SN.

In the urban environment, however, the LoS link between the UAV and the SN is likely to be blocked by obstructions like tall buildings. Intelligent reflecting Surfaces (IRS) is one of the technologies that have great potential in future wireless networks. It is is a plane composed of a large number of low-cost passive reflective elements, each of which can independently adjust the phase of an incoming signal. This allows for intelligently reconstructing the wireless propagation environment and improving the channel quality [4]. IRS is also amenable to installation. As a result, IRS can be used to overcome the channel blockage between the UAV and the SNs. Moreover, with IRS, other aspects of the communication systems, e.g., the UAV energy consumption [5] and the network throughput [6], can be improved.

In contrast to the above works, we consider the deployment of an IRS for UAV-assisted data collection in the IoT. We assume the energy of the UAV is limited and there is no charging station. For either periodic or random sampling of the SNs, we aim to minimize the SNs’ average AoI by jointly optimizing the UAV flight trajectory, the SNs’ scheduling and IRS phase shift matrix. This is modeled as a finite Markov decision process (MDP). To solve this problem, the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm [7] in DRL is proposed to learn and find the optimal policy for the flight trajectory and node scheduling. For a scheduled transmission from a SN to the UAV, the IRS, based on the channel information, is operated in such a way that the phase shifts of the signals are aligned. Simulation results demonstrate that with IRS and the learned optimal policy, both the average AOI and the transmission power of the SNs can be significantly reduced.

2 System Model

As shown in Fig. 1, we consider an IoT, where I SNs are distributed in the rectangular area with side lengths \(x_{max}^{U}\) and \(y_{max}^{U}\) to sample the environment and a rotary-wing UAV acts as a mobile base station to collect status-update information. The horizontal location of the i-th SN is expressed as \(q_{i}=\left[ x_{i},y_{i}\right] \in \mathbb {R^{\textrm{2}}}\left( i\in \mathcal {I}=\left\{ 1,...,I\right\} \right) \). However, obstructions such as tall buildings and trees in congested cities cause severe path loss and high attenuation to the air-to-ground channels between the UAV and SNs. In this case, we deploy an IRS on a high-rise building with height \(H_{R}\) to improve channel quality by reflecting signals controllably. The horizontal location of the IRS is defined as \(q_{R}=\left[ x_{R},y_{R}\right] \in \mathbb {R^{\textrm{2}}}\). For simplicity, we assume a time-slotted system where the length of each time slot is \(T_{ts}\) seconds. The UAV flies at height \(H_{U}\) over the rectangular target area. The horizontal location of the UAV at the time slot t can be defined by \(q_{t}^{U}=\left[ x_{t}^{U},y_{t}^{U}\right] \in \mathbb {R^{\textrm{2}}}\left( t\in \mathcal {T}=\left\{ 1,\ldots ,T\right\} \right) \). Furthermore, \(T\in \mathbb {N}\) depends on the UAV’s maximum carried energy \(e_{max}\) and the service process.

Fig. 1.
figure 1

IRS-assisted UAV data collection

2.1 Channel Model

The ground-to-air communication channel between each SN and the UAV includes two links: the direct link from the SN to the UAV, and the indirect link reflected by the IRS. The distance between the UAV and SN i at slot t is given by \(d_{t}^{i,U}=\sqrt{||q_{t}^{U}-q_{i}||^{2}+\left( H_{U}\right) ^{2}}\). Similarly, the distances between the SN i and IRS and between the IRS and UAV are denoted as \(d_{i,R}\) and \(d_{t}^{R,U}\), respectively. Assume that the IRS is composed of an \(M\times M\) uniform planar array (UPA) with \(M^{2}\) reflective elements. The set of reflective elements is defined as \(\mathcal {M}=\left\{ 1,...,M^{2}\right\} \).

1) Direct link: According to the channel model in [8], the features of LoS and non-line-of-sight (NLoS) links are preserved and appear with a certain probability, respectively. Then, the channel gain between the UAV and SN i at slot t is given by

$$\begin{aligned} h_{t}^{i,U}={\left\{ \begin{array}{ll} \sqrt{\frac{\beta _{0}}{\left( d_{t}^{i,U}\right) ^{\alpha _{1}}}}\widetilde{h}_{i,t}, &{} \text {LoS}\ \text {link}\\ \sqrt{\nu \frac{\beta _{0}}{\left( d_{t}^{i,U}\right) ^{\alpha _{1}}}}\widetilde{h}_{i,t}, &{} \text {NLoS}\ \text {link} \end{array}\right. }, \end{aligned}$$
(1)

where \(\beta _{0}\) is the pathloss at the reference distance of 1 m, \(\alpha _{1}\) is the path loss exponent for the direct link, \(\nu <1\) denotes the attenuation factor due to NLoS, and \(\widetilde{h}_{i,t}\) is the small-scale fading that follows the complex Gaussian distribution with mean 0 and variance 1.

2) Indirect link: The channel gain between the IRS and SN i obeys the Rician fading at slot t, \(\boldsymbol{h}_{t}^{i,R}\in \mathbb {C^{\mathrm { M^{\textrm{2}}\times 1 }}}\), which can be expressed as \(\boldsymbol{h}_{t}^{i,R}=\sqrt{\frac{\beta _{0}}{\left( d_{i,R}\right) ^{\alpha _{2}}}}\left( \sqrt{\frac{k}{1+k}}\boldsymbol{h}_{LoS}^{i,R}+\sqrt{\frac{1}{1+k}}\boldsymbol{h}_{t,NLoS}^{i,R}\right) \), where k is the Rician factor, \(\alpha _{2}\) is the path loss exponent between the SN and the IRS, \(\boldsymbol{h}_{LoS}^{i,R}\) is the LOS component, and \(\boldsymbol{h}_{t,NLoS}^{i,R}\) is the NLOS component modeled as a complex Gaussian variable with mean 0 and variance 1. Here, \( \boldsymbol{h}_{LoS}^{i,R}=[1,\cdots ,e^{-j\frac{2\pi d}{\lambda }(M-1)\sin \theta _{i,R}\cos \zeta _{i,R}}]^{H} \otimes [1,\cdots ,e^{-j\frac{2\pi d}{\lambda }(M-1)\sin \theta _{i,R}\sin \zeta _{i,R}}]^{H}\in \mathbb {C^{\mathrm { M^{\textrm{2}}\times 1 }}}\), where d is the distance between the IRS elements, \(\lambda \) is the carrier length, \(\theta _{i,R}\) and \(\zeta _{i,R}\) represent the vertical and horizontal angle-of-departures (AoDs) from the SN i to IRS at slot t, respectively. In addition, the geographical relationships are \(\sin \theta _{i,R}=\frac{H_{R}}{\left\| q_{R}-q_{i}\right\| }\), \(\sin \zeta _{i,R}=\frac{\left| x_{i}-x_{R}\right| }{\left\| q_{R}-q_{i}\right\| }\) and \(\cos \zeta _{i,R}=\frac{\left| y_{i}-y_{R}\right| }{\left\| q_{R}-q_{i}\right\| }\).

The channel between the UAV and the IRS is dominated by the LoS link. Similarly, the channel gain at slot t is expressed as \(\boldsymbol{h}_{t}^{R,U}=\sqrt{\frac{\beta _{0}}{\left( d_{t}^{R,U}\right) ^{2}}}\boldsymbol{h}_{t,LoS}^{R,U}\in \mathbb {C^{\mathrm { M^{\textrm{2}}\times 1 }}}\), where \(\boldsymbol{h}_{t,LoS}^{R,U}\) is the LOS component from the IRS to the UAV. The IRS phase shift matrix at slot t is defined as \(\varTheta _{t}=\textrm{diag}\left( e^{j\theta _{t}^{1}},\ldots ,e^{j\theta _{t}^{M^{2}}}\right) \), where \(\theta _{t}^{m}\in \left[ 0,2\pi \right) \) is the phase shift of the m-th element. Therefore, the signal-to-noise ratio (SNR) can be computed as \(\eta _{t}^{i,U}=\frac{P\left| \left( \boldsymbol{h}_{t}^{R,U}\right) ^{H}\varTheta _{t}\boldsymbol{h}_{t}^{i,R}+h_{t}^{i,U}\right| ^{2}}{\sigma ^{2}}\), where P is the SN’s transmit power and \(\sigma ^{2}\) is the noise power. If the SNR is less than a threshold \(\eta _{th}\), i.e., \(\eta _{t}^{i,U}<\eta _{th}\), the UAV cannot decode the received signal successfully.

2.2 Queuing Model

Each SN samples periodically or randomly the environmental information, referred to as fixed sampling and random sampling. The sensed information is packaged into an update packet of \(\omega \) bits with a timestamp [3]. Then, the packet is stored in the buffer of the SN and waits for collection by the UAV. Let \(g_{t}^{i}\in \left\{ 0,1\right\} \) denotes the sampling action of SN i at slot t. Specifically, \(g_{t}^{i}=1\) denotes that SN i generates an update packet at slot t, and otherwise \(g_{t}^{i}=0\). Once an update packet arrives at SN i in the beginning of each slot t, its lifetime is recorded and updated as

$$\begin{aligned} U_{t}^{i}={\left\{ \begin{array}{ll} 0, &{} \ g_{t}^{i}=1\\ U_{t-1}^{i}+1, &{} \text {otherwise} \end{array}\right. }. \end{aligned}$$
(2)

Let \(c_{t}^{i}\in \left\{ 0,1\right\} \) be the binary user scheduling variable. \(c_{t}^{i}=1\) indicates that the SN i is associated and ready to send one update packet to the UAV at slot t, and otherwise \(c_{t}^{i}=0\). To fully exploit the IRS, it is assumed that the UAV is associated with at most one SN in each time slot. If the update packet is successfully delivered to the UAV with the aid of IRS, we say that SN i is served at slot t. Accordingly, the service state of SN i is set \(z_{t}^{i}=1\). If the transmission is failed or no transmission takes place, the service state is set as \(z_{t}^{i}=0\). After a successful transmission, the AoI of this SN is updated according to the lifetime of the delivered update packet, and otherwise the AoI increases by one after a time slot. At the beginning of each slot t, the AoI is updated as

$$\begin{aligned} A_{t}^{i}={\left\{ \begin{array}{ll} U_{t-1}^{i}+1, &{} z_{t-1}^{i}=1\\ A_{t-1}^{i}+1, &{} \text {otherwise} \end{array}\right. }. \end{aligned}$$
(3)

The average AoI of all SNs at time slot t is given by \(\overline{A_{t}}=\frac{1}{I}\sum _{i=1}^{I}\mathop {A_{t}^{i}}\).

2.3 Problem Description

The UAV consumes energy on flight and hovering. The energy consumption for receiving and decoding the update packets is relatively small and can be omitted. The UAV hovers and collects data during the transmission interval \(T_{s}\) and then flies to the next location during the remaining time \(T_{ts}-T_{s}\). If no data needs to be collected, the UAV flies across the entire time slot. Therefore, the energy consumption at slot t can be expressed as

$$\begin{aligned} e_{t}^{co}={\left\{ \begin{array}{ll} P_{t}^{f}T_{ts}, &{} \sum _{i=1}^{I}c_{t}^{i}=0\\ P_{ho}T_{s}+P_{t}^{f}\left( T_{ts}-T_{s}\right) , &{} \text {otherwise} \end{array}\right. }, \end{aligned}$$
(4)

where \(P_{t}^{f}\) is the UAV flight power as a function of flight speed \(v_{t}^{U}\) and \(P_{ho}\) is the hovering power, which can be obtained from Eq. (11) in [5]. Then, the remaining energy of the UAV at slot t can be computed as \(e_{t}^{re}=e_{t-1}^{re}-e_{t}^{co}\).

The objective is to minimize the weighted sum of the average AoI of all SNs and the UAV’s energy consumption by jointly optimizing the UAV flight trajectory \(\boldsymbol{Q}=\left[ q_{t}^{U},\forall t\in \mathcal {T}\right] \), SN scheduling \(\boldsymbol{C}=\left[ c_{t}^{i},\forall i\in I,\forall t\in \mathcal {T}\right] \), and IRS’s phase shift matrix \(\boldsymbol{\Phi }=\left[ \varTheta _{t},\forall t\in \mathcal {T}\right] \). The optimization problem can be formulated as

$$\begin{aligned} \min _{\boldsymbol{Q},\boldsymbol{C},\Phi }\ \ \frac{1}{T}\sum _{t=1}^{T}\left( \overline{A_{t}}+\delta e_{t}^{co}\right) \ \ \ \ \ \ \ \ \end{aligned}$$
(5)
$$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ s.t.\ C 1:\sum _{i=1}^{I}c_{t}^{i}\le \mathrm {1,\ }c_{t}^{i}\in \left\{ 0,1\right\} ,\ {\forall }t\in \mathcal {T}, \ \ \ \ \ \ \ \ $$
$$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ C\textrm{2} :0\le \theta _{t}^{m}<2\pi ,\ \forall m\in \mathcal {\mathcal {M}}, \ \ \ \ \ \ \ \ $$
$$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ C 3:0\le x_{t}^{U}\le x_{max}^{U},\ 0\le y_{t}^{U}\le y_{max}^{U},\ \forall t\in \mathcal {T}, \ \ \ \ \ \ \ \ $$
$$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ C4:0<\left\| q_{t}^{U}-q_{t-1}^{U}\right\| < v_{max}^{U}T_{ts},\ {\forall } t\in \mathcal {T}, \ \ \ \ \ \ \ \ $$

where \(v_{max}^{U}\) is the maximum flying speed of the UAV and \(\delta \) is the relative importance factor. It is quite difficult to solve the above mixed integer non-convex problem. Hence, we propose the TD3-based algorithm for UAV-enabled data collection which is able to make the best decision quickly and accurately even when the scale of the problem is very large.

3 TD3-Based UAV Data Collection Method

Then, a TD3 algorithm is proposed for the UAV-enabled data collection to find the optimal UAV trajectory and SN scheduling policy efficinetly. During each packet transmission, the IRS’s phase shift matrix is optimized based on the perfectly estimated channel state to maximize the received SNR at the UAV.

3.1 Optimization of IRS Phase Shift Matrix

Given the SN scheduling and UAV’s location, the received SNR is maximized by optimizing the phase shifts of IRS, which is equivalent to the following problem:

$$\begin{aligned} \min _{\varTheta _{t}}\left| \left( \boldsymbol{h}_{t}^{R,U}\right) ^{H}\varTheta _{t}\boldsymbol{h}_{t}^{i,R}+h_{t}^{i,U}\right| ^{2} \end{aligned}$$
(6)
$$ s.t.\ 0\le \theta _{t}^{m}< 2\pi ,\forall m {\in }\mathcal {M} ,\ \ \ \ \ \ \ \ $$

From [9], the optimal phase shifts of IRS can be obtained by aligning the phases of the direct and indirect links between the UAV and the associated SN. In particular, when scheduling SN i, the optimal phase shift of the m-th element of IRS can be expressed as \(\theta _{t}^{m,*}=\phi _{t}^{i,U}-\left( \omega _{t,m}^{i,R}+\omega _{t,m}^{R,U}\right) ,\forall m\in \mathcal {M}\), where \(\phi _{t}^{i,U}\), \(\omega _{t,m}^{i,R}\) and \(\omega _{t,m}^{R,U}\) are the phases of the direct SN-UAV link, and the SN-IRS and IRS-UAV links via element m, respectively.

3.2 TD3 Algorithm Design

MDP Problem Formulation: The optimization problem can be modeled as a finite MDP. In the sequel, we define the state space, action space and reward function, respectively.

1) State space: The system state at slot t is defined as \(s_{t}=\left[ q_{t}^{U},e_{t}^{co},\boldsymbol{A}_{t},\boldsymbol{U}_{t}\right] \), where \(\boldsymbol{A}_{t}=\left[ \mathop {A_{t}^{i}},\forall i\in I\right] \) and \(\boldsymbol{U}_{t}=\left[ \mathop {U_{t}^{i}},\forall i\in I\right] \).

2) Action space: The system action at slot t is defined as \(a_{t}=\left[ \mu _{t}^{U},d_{t}^{U},\boldsymbol{C}_{t}\right] \), which includes the UAV’s flight angle \(\mu _{t}^{U}\in \left( 0,2\pi \right] \) and distance \(d_{t}^{U}\in \left[ 0,d_{i,max}\right] \). Therefore, the horizontal position of the UAV at slot \(t+1\) is updated as \(q_{t+1}^{U}=\left[ x_{t}^{U}+d_{t}^{U}\cos \mu _{t}^{U},y_{t}^{U}+d_{t}^{U}\sin \mu _{t}^{U}\right] \).

3) Reward function: Given the state \(s_{t}\) and action \(a_{t}\), the reward function at time slot t can be defined as \(r_{t}(s_{t},a_{t})=-\left( \overline{A_{t}}+\delta e_{t}^{co}\right) +p_{t}\), where \(p_{t}\) is the penalty at slot t that gives punishment for an invalid action. For example, if the current action \(a_{t}\) causes the UAV to fly out of the target area, we set \(p_{t}<0\) and otherwise \(p_{t}=0\).

The objective is to find the optimal policy \(\pi ^{*}\) to minimize the long-term return function \(C_{\pi }=\mathbb {E}_{\pi }\left[ \sum _{t=1}^{T}\left( \gamma \right) ^{t-1}r_{t}(s_{t},a_{t})|s_{1}\right] \), where \(\mathbb {E}_{\pi }\) is the expectation under policy \(\pi ,\) \(\gamma \in \left[ 0,1\right] \) is the discount factor, and \(s_{1}\) is the initial state.

TD3 Algorithm: The TD3 algorithm is based on an Actor-Critic framework consisting of deep neural networks (DNN) [7] to find the optimal policy, which has one Actor network that obtains the deterministic policy \(\pi _{\vartheta }\left( s\right) \) , and two Critic networks that obtains the value function \(Q_{\varphi }\left( a,s\right) \). In addition, there are two target Critic networks with function \(Q_{\varphi ^{'}}\left( a,s\right) \) and one target Actor network with function \(\pi _{\vartheta ^{'}}\left( s\right) \). The Actor network can randomly extract mini-batches of samples from the replay buffer to train the network parameters. The policy gradient is \(\nabla _{\vartheta }J\left( \vartheta \right) =N^{-1}\sum \nabla _{a}Q_{\varphi _{1}}\left( s,a\right) |_{a=\pi _{\vartheta }\left( s\right) }\nabla _{\vartheta }\pi _{\vartheta }\left( s\right) \), where N is the mini-batch size. The target Actor network copys the Actor network parameters periodically to stabilize the training process, and the target Critic network is the same. The smaller Q value in the two target Critic networks is selected as the target value: \(y=r+\gamma \underset{l=1,2}{\min }Q_{\varphi _{l}^{'}}\left( s^{'},\pi _{\vartheta ^{'}}\left( s^{'}\right) +\varepsilon \right) \), where \(\varepsilon \sim \textrm{clip}\left( N\left( 0,\sigma \right) ,-c,c\right) \) denotes the noise trimmed according to the normal distribution,which can avoid the overestimation problem. Then, the loss function is used to train the two Critic networks, which is expressed as \(L\left( \varphi _{i}\right) =N^{-1}\sum \left( y-Q_{\varphi _{l}}\left( s,a\right) \right) ^{2}\). The details of TD3-based UAV Data Collection are shown in Algorithm 1.

figure a

4 Simulation Results

We consider a \(300\,\text {m}\times 400\,\text {m}\) rectangular target area, and set the lower left corner of the area as the coordinate origin. The coordinate of IRS is set as [0, 150, 30], and the horizontal coordinate of three SNs are set as [10, 180], [85, 350], [225, 50]. The random sampling process is modeled as a Poisson process. The system simulation parameters are set as follows: \(T_{ts}=1\,s\), \(T_{s}=0.5\,s\), \(H_{U}=\text {60}\,\text {m}\), \(\beta _{0}=-45\,\text {dB}\) ,\(\alpha _{1}=3.1\), \(\alpha _{2}=2.3\), \(\sigma ^{2}=-110\,\text {dBm}\), \(\omega =110\,\text {KB}\), \(\delta =0.001\), \(\eta _{th}=0.77\), \(v_{max}^{U}=40\,\text {m/s}\), \(e_{max}=1.2\,e5\text {J}\), \(e_{th}=8\,e3\text {J}\). If there is no specific explanation, the reinforcement learning parameters are shown in Table 1.

Table 1. Learning parameters

Figure 2 shows the convergence curves of the proposed TD3 and PPO algorithms, when fixed sampling with rate 0.2 is applied and the RIS is 15\(\,\times \,\)15. It is observed that the TD3 algorithm converges faster and more stably, and is more suitable for our studied problem. This is because the PPO algorithm tends to have insufficient explorations and usually find a suboptimal policy.

Fig. 2.
figure 2

Convergence comparison between TD3 and PPO algorithms.

In Fig. 3(a) and Fig. 3(b), we illustrate the average AoI performance of the proposed TD3-based method for different random sampling rates and transmission powers, respectively. As shown in Fig. 3(a), given any IRS phase control policy, it is observed that a higher sampling rate leads to the smaller average AoI, since the sensing data can be collected more frequently. By optimal phase alignment for IRS, our proposed scheme achieves the minimum AoI value for any sampling rate, which indicates that the IRS-aided UAV data collection scheme can significantly improve the information freshness. In Fig. 3(b), as either the number of IRS reflecting elements or the transmission power of the SN increases, the average AoI can be greatly reduced. In both subfigures, the transmission scheme without IRS leads to the highest AoI of SNs.

Fig. 3.
figure 3

The average AoI performance of the proposed TD3 method

5 Conclusion

This paper investigated the efficient IRS-assisted UAV data collection problem for IoT. The IRS was deployed on a tall building to improve the channel quality of the UAV and SNs, and the optimal IRS reflection coefficient was obtained by phase alignment. The problem was modeled as a MDP problem, and we proposed the TD3 algorithm in DRL to find to find the optimal UAV trajectory and SN scheduling policy to minimizes the average AoI. Simulation results showed that integrating IRS into UAV data collection can effectively reduce the average AoI regardless of whether the SNs periodically or randomly sample environmental information, and the TD3 algorithm outperforms the PPO algorithm in terms of convergence speed and stability after convergence for the problems in this paper. The larger the IRS the lower the transmission power of the SN with guaranteed average AoI.