Keywords

1 Introduction

Radio spectrum resources are an essential resource. According to a white paper published by Cisco on global mobile data traffic forecasts for 2017–2022, global mobile data traffic will grow sevenfold between 2017 and 2022 [1]. However, related studies have revealed a phenomenon that many spectrum resources are not used effectively [2, 3]. D2D communication technology is considered a feasible solution to the problem of poor spectrum resources, with the advantages of improved spectrum efficiency and reduced communication delays [4, 5]. Furthermore, considering the limitations of traditional static spectrum allocation policies, dynamic spectrum access techniques have also been proposed to improve spectrum efficiency [6]. In D2D communication technology, cellular users are subject to severe interference when D2D users share the same spectrum as cellular users and strong interference between D2D users can also seriously affect the quality of communication [4, 7, 8]. Similarly, dynamic spectrum access also faces two fundamental problems, namely interference coordination between DSA users and interference suppression for primary users [9].

Previous research has proposed a number of schemes for spectrum allocation between D2D users and cellular users [10, 11]. These studies investigated the reuse of cellular user resources by D2D communication users in a non-orthogonal spectrum allocation. In [10], the authors proposed a distributed Q-learning-based spectrum allocation scheme to maximize D2D user system throughput and maintain the QoS requirements for cellular users. In [11], the authors proposed a distributed DRL-based spectrum allocation scheme to address the issue of interference and resource allocation between D2D and cellular users, with the aim of maximizing system throughput. However, little existing research has considered the use of distributed spectrum access schemes to avoid conflicts between D2D communication users and cellular users as well as other D2D users.

In this paper, we consider an uplink scenario of a D2D underlying cellular communication network, and to address the collision problem between DUEs and CUEs, we propose a DRL-based distributed dynamic spectrum access scheme. In particular, we introduce the concept of a “reusable area” [12], where D2D users can choose the number of reusable CUEs based on the range of “reusable areas”. According to the DRL theory, we enable each agent to learn the optimal access policy only through imperfect spectrum sensing without knowing the system a priori information, increasing the system throughput while avoiding collisions with other DUEs and CUEs.

2 System Model

We consider a dynamic spectrum access scenario in the uplink of a D2D underlying cellular network. As shown in Fig. 1, the system model includes N cellular users (CUEs) denoted by \(\mathbb {N} =\left\{ 1,2,\ldots ,N\right\} \) and K pairs of D2D users denoted by \(\mathbb {K} =\left\{ 1,2,\ldots ,K\right\} \), each D2D pair consists of a set of transmitters (DTx) and receivers (DRx). \(d_{ii}\) denotes the distance between DTx and DRx, \(d_{jk}\) denotes the distance between the CUEs and the BS, and \(d_{ik}\) denotes the distance between DTx and the BS. We assume that the system has N channels and that each CUE transmits on a unique channel, thus avoiding interference between CUEs.

Fig. 1.
figure 1

Uplink scenario for D2D communications underlying cellular networks.

2.1 Channel Model

We adopt the WINNER II channel model to calculate the path loss generated by the signal propagation in space [15], which is described as a distance-dependent function

$$\begin{aligned} P L\left( d, f_{c}\right) =\overline{P L}+ B\log _{10}(d[m])+C \log _{10}\left( \frac{f_{c}\left( G H_{z}\right) }{5}\right) , \end{aligned}$$
(1)

where \(f_{c}\) denotes the carrier frequency, \(\overline{P L}\), B and C denote the unit distance loss reference, the path loss exponent and the path loss frequency dependence, respectively. For simplicity, we assume the existence of a strong line of sight (LOS) path between the signal transmission links. Therefore, our model can use the Rician channel model for channel modeling [13], which can be expressed as

$$\begin{aligned} h=\sqrt{\frac{\kappa }{\kappa +1}\sigma } e^{j \theta }+\sqrt{\frac{1}{\kappa +1}} C N\left( 0, \sigma ^{2}\right) \end{aligned}$$
(2)

where \(\sigma ^{2}=10^{-\frac{\overline{P L}+B \cdot \log _{10}(d[m])+C \cdot \log _{10}\left( \frac{f_{c}[G H z]}{5}\right) }{10}}\) is determined by path loss, \(\kappa \) means the \(\kappa - factor\), defined as the power ratio of the LOS component to the scattering component, \(\theta \) denotes the phase and takes the value of a uniform distribution between 0 and 1, \(CN\left( \cdot \right) \) denotes a circularly symmetric complex Gaussian random variable.

2.2 Uplink Signal Model

For the uplink scenario, when DUEs and CUEs transmit in the same time slot, DUEs can cause harmful interference to CUEs. Hence, the instantaneous signal to interference plus noise ratio (SINR) received by the BS from the CUEs can be expressed as

$$\begin{aligned} S I N R_{j}=\frac{P_{c} \cdot \left| h_{j k}\right| ^{2}}{P_{d} \cdot \left| h_{i k}\right| ^{2}+B \cdot N_{0}} \end{aligned}$$
(3)

where \(P_{c}\) and \(P_{d}\) represent the transmit power of the CUE and the DTx, respectively. \(\left| h_{j k}\right| ^{2}\) denotes the channel gain of the cellular link, and \(\left| h_{i k}\right| ^{2}\) denotes the channel gain from the \(i^{th}\) D2D transmitter to the BS, , which can be derived according to (2). B and \(N_{0}\) represent the channel bandwidth and noise spectral density, respectively. Furthermore, we assume in this model that each channel can be used by at most one D2D pair.

Fig. 2.
figure 2

Dynamic spectrum access framework based on D2D underlying cellular networks.

3 DRL-based Dynamic Spectrum Access Scheme

3.1 Deep Reinforcement Learning

A reinforcement learning model contains three components: possible states in the environment, possible actions that the agent may take based on a policy \(\pi \), and a feedback reward function that the agent receives after making an action. These three components are defined as \(s_{t}\), \(a_{t}\), and \(r_{t+1}\). The goal of the agent is to learn an optimal policy \(\pi ^{*}\) to maximize the cumulative discount reward \(R_{t} = \sum _{i=0}^{\infty } \gamma ^{i}r_{t+1+i}\), where \(\gamma \in [0,1]\) represents the discount factor. Q-values are updated with the following rules:

$$\begin{aligned} \begin{aligned} Q\left( s_{t}, a_{t}\right) =\,&Q\left( s_{t}, a_{t}\right) +\\&\alpha \left( r_{t+1}+\gamma \max _{a} Q\left( s_{t+1}, a\right) -Q\left( s_{t}, a_{t}\right) \right) , \end{aligned} \end{aligned}$$
(4)

where \(\alpha \in (0,1]\) is the learning rate. Furthermore, the policy function \(\pi \) is updated by means of the \(\epsilon \)-greedy algorithm.

DRL uses deep neural network (DNN) to approximate Q values (DQN), i.e. \(Q(s_{t},a_{t};\boldsymbol{\theta }) \approx Q(s_{t}, a_{t})\), where \(\boldsymbol{\theta }\) is the network weights. In DQN, the TD algorithm is mostly used to calculate the loss function,

$$\begin{aligned} Loss(\theta ) = E[(y_{t} - Q(s_{t},a_{t};\boldsymbol{\theta )})^{2}] \end{aligned}$$
(5)

where \(y_{t}\) represents the target Q-value and is defined as

$$\begin{aligned} y_{t} = r_{t+1} + \gamma \max _{a} Q\left( s_{t+1}, a; \boldsymbol{\theta }\right) \end{aligned}$$
(6)

After that, the agent can minimize the loss function by the gradient descent algorithm as follows:

$$\begin{aligned} \boldsymbol{\theta _{t+1}}=\boldsymbol{\theta _{t}}+\alpha E\left[ \left( y_{t}-Q\left( s_{t}, a_{t} ; \boldsymbol{\theta }\right) \right) \nabla Q\left( s_{t}, a_{t} ; \boldsymbol{\theta }\right) \right] \end{aligned}$$
(7)

where \(\alpha \) is the learning rate.

3.2 Uplink Dynamic Spectrum Access Framework

In our proposed scheme, as shown in Fig. 2, we have designed the distributed dynamic spectrum access as a deep reinforcement learning model. In Sect. 2, we show that the system model consists of N channels and K D2D users, where each channel is occupied by a CUE. Therefore, we describe the channel states as Active and Inactive, denoted by 0 and 1 respectively, where Active indicates that the channel is occupied by a CUE and Inactive indicates that the channel is not occupied by a CUE. In addition, the detailed definitions of “state”, “action” and “reward” are as follows.

State: At the beginning of each time slot, each D2D pair will sense the channel state in the environment, which may contain errors. Subsequently, the agent uses the sensed results as input data for the neural network to be trained. Therefore, the state space \(\textbf{S}^k (t)\) of each D2D pair is defined as \(\textbf{S}^k (t) = [s_{1}^{k} (t), \ldots , s_{n}^{k} (t)]\), where \(\textbf{S}^k (t)\) denotes an N-dimensional vector, k denotes the \(k^{th}\) D2D pair, n denotes the number of channels and \(s_{n}^{k} (t)\) denotes the state of the channel (Active or Inactive).

Action: The agent decides whether to access and which wireless channel to access based on the spectrum access policy. Hence, the action space A can be defined as \(A \in \left\{ 0,1,\ldots ,N\right\} \), where \(a_{t} = 0\) means that the agent does not access the channel and \(a_{t} = N\) means that the agent accesses the \(n^{th}\) channel.

Rewards: According to the situations that the agent may face after making an action choice, the following reward function setting scheme is developed.

  1. 1.

    The D2D pair collides with the CUE. This indicates that the D2D pair accesses the channel where the CUE is located when the cellular link resources cannot be reused. In Sect. 2, we mention the concept of warning signals. Therefore, we give a penalty value of \(-4\) as the result of a warning signal being received by the agent. For convenience, we define this case as \(\mathbb {C}\).

  2. 2.

    A collision between D2D users. This case indicates that different D2D users are accessing the same channel. We set the reward value for this case to 0 and define this case as \(\mathbb {D}\).

  3. 3.

    The D2D pairs do not access any channel. We set the reward value for this case to 1. Similarly, we define this case as \(\mathbb {I}\).

  4. 4.

    The D2D pair successfully accesses the channel. The reward value for this case should be set to the maximum. We considered the normalized \(\hat{SINR_{j}}\) and applied it to our reward function setting, described as \(1 + \log _{2}(1 + \hat{SINR_{j}})\). We define this case as \(\mathbb {S}\).

In summary, the reward function for the \(k^{th}\) D2D pair on the \(n^{th}\) channel can be described as

$$\begin{aligned} r_{t+1}^{k}= {\left\{ \begin{array}{ll} -4, &{} \text {Case } \mathbb {C} \\ 0, &{} \text {Case } \mathbb {D} \\ 1, &{} \text {Case } \mathbb {I}\\ 1 + \log _{2}(1+\hat{SINR_{j}}), &{} \text {Case } \mathbb {S}\end{array}\right. } \end{aligned}$$
(8)
Fig. 3.
figure 3

Performance evaluation in non-orthogonal scenarios \((P=\frac{1}{2}R)\).

Fig. 4.
figure 4

Performance evaluation in non-orthogonal scenarios \((P=R)\).

4 Performance Evaluation

In this section, we evaluate the performance of the algorithmic framework proposed in the scheme. Specifically, we compare the algorithm used in the scheme with the Myopic algorithm [14] based on a priori information about the system and the DQN+ESN [13] to verify the performance.

In our scheme, we consider a cellular cell scenario with a radius of 100m. The locations of the D2D pair and the CUE in this cell are randomly generated and we specify that the communication distance between the transmitter and the receiver of the D2D pair is randomly generated in the range of 20 m and 40 m. Therefore, the location of the CUE may fall within the “reusable area” that we have defined. We express the percentage of the reusable area in the cell by \(P=\frac{\pi \cdot L_{th}^{2}}{\pi \cdot R^{2}}\), where the choice of the threshold distance \(L_{th}\) is determined by the value of P. The specific simulation parameters are shown in Table 1.

Table 1. Simulation parameters

In this scenario, we set the number of D2D users to 4 and the number of CUEs to 2. The positions of the CUEs are randomly generated, and by calculating the distances from the CUEs to the BS, we can obtain the distances to be 95.75 m and 30.36 m. Therefore, we first consider the case of \(P=\frac{1}{2}R\), after which we can calculate the corresponding threshold distance \(L_{th}\) of \(50\sqrt{2}\) m. According to the range of reusable zones corresponding to the threshold distance, the D2D pair can reuse one CUE resource. The simulation results are shown in Fig. 3, where the performance obtained using the DQN-based method is significantly better than using the Myopic algorithm. In particular, the Myopic algorithm selects the action with the greatest immediate reward and therefore it performs well in avoiding collisions with the CUE. However, simulation results show that using the DQN-based method after training also maximizes throughput while achieving collision avoidance. In this scenario, our scheme performs approximately the same as DQN+ESN. Additionally, we consider the non-orthogonal access scenario when \(P=R\). In this scenario, the reusable area covers the entire cellular cell. Therefore, all CUEs in the cell can be reused by the D2D pair. The simulation result is shown in Fig. 4, which shows that our scheme achieves the theoretical maximum success rate and has better throughput.

5 Conclusion

In this paper, we investigated the case of dynamic spectrum access in the uplink of a D2D underlying cellular network. Specifically, we proposed a distributed dynamic spectrum access scheme under imperfect spectrum sensing conditions, which aims to allow D2D users to learn an optimal spectrum access policy to maximize throughput without knowing a priori information. Besides, we introduced the concept of a reusable area, where the D2D user can choose the number of reusable CUEs based on the coverage of that area. Simulation results show that our scheme can avoid collisions while maximizing the system throughput.