1 Introduction

Device-to-device communications, as one of the promising techniques for the 5 G and beyond communication systems, leverage the proximity communicating and spectrum reusing. Thus, D2D communication can significantly reduce the latency and improve the spectrum efficiency without traversing the base station (BS) [1]. However, cellular users (CUs) located far away from the BS would suffer strong interference when D2D pairs share their resources without a sophisticated access mechanism.

There are many contributions dedicated to the resource allocation for D2D communications [2,3,4,5]. In Liang et al. [2], the authors proposed an algorithm of spectrum allocation and power optimization to overcome the challenges of dynamic D2D channels. A distributed spectrum allocation framework was proposed in Li and Guo [3] by adopting the actor-critic (AC) scheme to handle a decision making problem with state-action spaces. In Najla et al. [4] proposed a sequential bargaining game to determine the coalitions of the D2D pairs mutually reusing multiple channels. Furthermore, Kai et al. [5] considered a joint downlink and uplink resource allocation scheme to maximize the sum data rate of NOMA-enabled D2D groups while guaranteeing the QoS for both CUs and NOMA-enabled D2D groups. Recently, deep learning approaches have been explored in wireless communications [6,7,8,9,10]. Wang et al. [7] investigated the optimal policy for resource allocation in information-centric wireless networks by maximizing the spectrum efficiency based on deep reinforcement learning. Relying on the local user information and observations, multi-agent reinforcement learning (MARL) based approaches have been widely applied. Take an instance, Vu et al. [8] proposed a distributed resource allocation algorithm to overcome the dynamic environment issue in vehicular communication systems by leveraging MARL. Double deep Q-network (DDQN) was proposed in Van Hasselt et al. [6] to overcome the overestimation by decomposing action selection and action evaluation into two DQNs. Huang et al., [9] designed a DDQN-based spectrum access (\(D^4SA\)) algorithm for D2D pairs to autonomously learn an optimal policy to maximize the sum rate in an underlay cellular network. Furthermore, The authors in Ji et al. [10] combined MARL and DDQN to propose a decentralized DDQN framework for resource allocation at users and a centralized DDQN for reconfigurable intelligent surface optimization at the BS. However, the existing deep learning based allocation methods would result in excessive memory overhead and make the entire network vulnerable.

To overcome the aforementioned issue, in this work, we consider a dynamic time division duplex (TDD) network where TSs are assigned to CUs orthogonally. D2D pairs act as agents to learn whether to access a TS based on the condition of both the CUs’ positions and the communication status in an underlaying manner.Footnote 1 The challenges come from two aspects. First, the channel of D2D communications varies fast and makes conventional resource allocation approaches based on alternating optimization hard to converge. Second, D2D pairs are assumed to have no information of the access behaviours of CUs. Thus,based on the deep reinforcement learning (DRL) philosophy, we adopt the double deep Q-network (DDGN) and propose a maximal throughput algorithm (MTA) for D2D spectrum access. Compared with the \(D^4SA\) algorithm proposed in Huang et al., [9], our method directly regards throughput as the target function rather than the number of accessed user links. The relationship between the Q-function and the throughput is linked through the well-known Shannon capacity. In \(D^4SA\), previous state information is utilized as inputs, while our method just uses current state information as inputs and thus enjoys less inputs, less computational complexity, and smaller memory requirement. Furthermore, the received signal-to-noise ratios (SNRs) are taken as inputs to the DDQN in our method. Simulation results demonstrate that the proposed algorithm can achieve a much higher throughput compared with the state-of-art contributions.

The remainder of the paper is organized as follows. In Sect. 2, the system model and problem formulation are presented. In Sect. 3, the proposed DDQN MTA algorithm is investigated. In Sect. 4, simulation results and complexity analysis are described. Finally, some conclusions are given in Sect. 5.

2 System model

2.1 System description

We consider a scenario where D2D communications underlaying a cellular network with I D2D pairs share the uplink spectrum resources of K CUs. In this letter, both D2D and cellular communications are assumed to follow the TDD principal, that is, the reciprocity holds between uplink and downlink channel state information (CSI). Thus, the overhead of timely uploading CSI and other related information to the BS can be reduced. The CUs and D2D pairs access the network through allocated TSs within each frame in a repetitive way. One frame is assumed to contain T TSs. The TSs allocated to CUs are assumed orthogonal for avoidance of co-channel interference among CUs. D2D pairs learn to share the spectrum in proper TSs to ensure the QoS requirements of CUs.

Let \(\alpha _{i,k}\) be the resource reuse factor of the i-th D2D pair and CU k, and \(\alpha _{i,k}=1\) if the i-th D2D pair reuses the spectrum assigned to CU k; otherwise, \(\alpha _{i,k}=0\). Set \({\mathcal {I}}=\{1,2,\ldots ,I\}\) and \({\mathcal {K}}=\{1,2,\ldots ,K\}\) denote the set of D2D pairs and the set of CUs, respectively. For clarity of explanation, at most one D2D pair can be allowed to access each of the shared TSs.Footnote 2 The signal-to-interference-plus-noise ratio (SINR) of the CU k can be expressed as

$$\begin{aligned} \gamma _k^C = \frac{P^CG_{B,k}^C}{\sigma ^2+ \sum _{i\in {\mathcal {I}}}\alpha _{i,k}P^DG_{B,i}^D}, \end{aligned}$$
(1)

where \(P^C\) and \(P^D\) are the transmit power of the CU and D2D transmitter (DT), respectively. \(G_{B,k}^C\) and \(G_{B,i}^D\) are the channel power gains from the CU k to BS and from DT i to BS, respectively. \(\sigma ^2\) is the variance of the additive white Gaussian noise. On the other hand, the SINR of the sharing the i-th D2D receiver (DR) can be described as

$$\begin{aligned} \gamma _i^D = \frac{P^DG_i^D}{\sigma ^2+ \sum _{k\in {\mathcal {K}}}\alpha _{i,k}P^CG_{k,i}^C}, \end{aligned}$$
(2)

where \(G_i^D\) and \(G_{k,i}^C\) are the channel power gain from DT i to DR i and from CU k to DR i, respectively. In order to ensure the requirement of QoS of CUs, the SINR of the CU \(\gamma _k^C\) should be kept above a predefined threshold \(\gamma _{th}\), i.e., \(\frac{P^CG_{B,k}^C}{\sigma ^2+P^DG_{B,i}^D} \ge \gamma _{th}\). Due to the fact that \(G_{B,k}^C \propto d_k^{-n}\) (e.g., \(n=3\) in fully scattered environment) where \(d_k\) stands for the distance from CU k to BS. The following relationship can be acquired

$$\begin{aligned} d_k \le \left( \frac{P_k^C}{\sigma ^2+P^DG_{B,i}^D}\right) ^{1/n} \triangleq d_{th}, \end{aligned}$$
(3)

where \(d_{th}\) is the threshold to guarantee the QoS of CU with shared resources if the CU is located close enough to the BS (\(d_k \le d_{th}\)). Only when a CU is located in the "sharable area", a circular area within \(d_{th}\) away from the BS, D2D pairs can be admissible to access the spectrum by sharing this CU’s allocated TSs. Otherwise, if the CU is far away from the BS, no D2D pair is allowed to reuse the spectrum resource of this CU. Then, the ergodic capacity of CU k and the i-th D2D pair sharing CU k at TS t can be expressed as

$$\begin{aligned} C_k^C[t]&= W[k]\log _2(1+\gamma _k^C[t]), \end{aligned}$$
(4)
$$\begin{aligned} C_i^D[t]&= W[k]\log _2(1+\gamma _i^D[t]), \end{aligned}$$
(5)

respectively, where W[k] is the allocated bandwidth of the k-th CU. The overall capacity of the underlay D2D network can be described as

$$\begin{aligned} C_{tot}=1/T\sum _{t=1}^{T}\left( \sum _{k=1}^{k}C_k^C[t] + \sum _{i=1}^{I}C_i^D[t]\right) \end{aligned}$$
(6)

2.2 Problem formulation

Our target is to maximize the overall capacity \(C_{tot}\) in (6) of the system by optimizing the reuse vector, \(\varvec{\alpha }=[\alpha _{1,1},\ldots ,\alpha _{1,K},\ldots ,\alpha _{I,1},\ldots ,\alpha _{I,K}]^T\) as follows.

$$\begin{aligned} \max _{\varvec{\alpha }}\,&C_{tot} \end{aligned}$$
(7a)
$$\begin{aligned} \text {subject to }&\gamma _k^C \ge \gamma _{th}, \forall k\in {\mathcal {K}} \end{aligned}$$
(7b)
$$\begin{aligned}&\alpha _{i,k}\in \{0,1\}, \forall i\in {\mathcal {I}}, k\in {\mathcal {K}} \end{aligned}$$
(7c)
$$\begin{aligned}&\sum _{k\in {\mathcal {K}}}\alpha _{i,k} \le 1, \end{aligned}$$
(7d)
$$\begin{aligned}&\sum _{i\in {\mathcal {I}}}\alpha _{i,k} \le 1, \end{aligned}$$
(7e)

where constraints (7d) and (7e) assume that each D2D pair only reuses one spectrum and at most one D2D pair can be allowed to transmit information at each TS. Constraints (7c)–(7e) make the above optimization problem non-convex. For solving this non-covex problem, one has to use an exhaustive searching, which is impractical on a large number of D2D pairs and/or CUs. This motivates us to alternatively leverage the DDQN framework for dynamic spectrum access, which is applicable to solve the above problem with a large number of D2D pairs and/or CUs as well as many states and action dimensions.

3 Proposed algorithm

3.1 Reinforcement learning and DQN

Fig. 1
figure 1

The interaction of the DDQN at the centralized agent

Reinforcement learning (RL) can be modeled as a Markov decision process [10], including an environment state \({\mathcal {S}}\), an action \({\mathcal {A}}\), and a reward \({\mathcal {R}}\) which is evaluated from each state-action pair. At each training step p, the agent observes the state \(s_p\) and responses an action \(a_p\) according to a certain policy \(\pi\). Then, the agent receives a corresponding reward \(r_p\) and transfers to the next state \(s_{p+1}\), which is determined by the current state \(s_p\) and action \(a_p\). This process can be denoted by a transfer tuple \(e_p = (s_p, a_p, r_p, s_{p+1})\). Figure 1 shows the interaction process for the centralized DDQN agent at the BS. This centralized agent decides whether to access and which DT to transmit at the current TS. If a DT is selected to transmit, it will receive feedback after the transmission and then inform the agent with the results. During the training step p, RL agent aims to learn an optimal policy to maximize the cumulative weighted reward, which is expressed as

$$\begin{aligned} R_p = \sum _{\tau =0}^{\infty }\beta ^{\tau }r_{p+\tau }, \end{aligned}$$
(8)

where \(0\le \beta \le 1\) is the discount factor to indicate the impact of the future rewards. The expected reward of a state-action pair (sa), a.k.a. action-value function, can be defined as

$$\begin{aligned} q^{\pi }(s,a)=E_{\pi }[R_p|s_p = s, a_p=a], \end{aligned}$$
(9)

where the policy \(\pi\) represents a mapping from state \({\mathcal {S}}\) to the probability that each action in \({\mathcal {A}}\) is selected. Then, the optimal policy \(\pi ^*\) can be described as \(\pi ^* = \text {arg max}_{\pi }\{q^{\pi }(s,a)\}\).

Q-learning, one of the typical RL algorithms, maintains a Q-value table of the action-value function to find the optimal policy \(\pi ^*\), i.e., \(Q^*(s,a)=\max _{\pi }q^{\pi }(s,a)\). However, as the cardinality of state set \({\mathcal {S}}\) and/or action set \({\mathcal {A}}\) is large, the resources for the convergence of the Q-learning will be huge and difficult for implementation. Alternately, deep Q-network (DQN) introduces a Q neural network (QNN) to approximate the Q-value instead of to maintain a Q-table. That is, given an input state \(s_p\), QNN outputs an estimated Q-value for all possible actions, that is, \(q(s_p, a;{\textbf{W}})\approx Q^*(s_p,a), \forall a\in {\mathcal {A}}\), where \({\textbf{W}}\) stands for the parameters of QNN. The training data set \({\mathcal {D}}=[{\textbf{e}}_1, {\textbf{e}}_2, \ldots ]\) for QNN is stored according to the agent’s experience \({\textbf{e}}_p = (s_p, a_p, r_p, s_{p+1})\) at each training step p. In the sequel, a training batch of experience \({\textbf{e}}_p\) is sampled randomly from the data set \({\mathcal {D}}\). The training process is used to minimize the loss function, which is defined as the mismatch between the target Q-value and the realistic Q-value,

$$\begin{aligned} Loss({\textbf{W}})&=E[(q_{target} - q(s,a;{\textbf{W}}))^2], \end{aligned}$$
(10)
$$\begin{aligned} \text {where }q_{target}&= r_p + \beta \max _{a'}q(s_{p+1},a';{\textbf{W}}'). \end{aligned}$$
(11)

Here \(q_{target}\) is the target Q-value of the target network with weights \({\textbf{W}}'\). Moreover, the weights \({\textbf{W}}\) (\({\textbf{W}}'\)) can be iteratively updated by the gradient descent method [11] as

$$\begin{aligned} {\textbf{W}} = {\textbf{W}} + \delta E[(q_{target} - q(s,a;{\textbf{W}}))\nabla q(s,a;{\textbf{W}})], \end{aligned}$$
(12)

where \(\delta\) stands for the updating constant and \(\nabla q(s,a;{\textbf{W}}) = q(s_p,a_p;{\textbf{W}}) - q(s_{p-1},a_{p-1};{\textbf{W}})\) is the backwards difference operator.

3.2 DDQN algorithm

It is well-known that DQN can achieve near-optimal solution in some scenarios, while it sometimes companies with the issue of overestimation, i.e., the target Q-value may be higher than the true optimum action-value. To overcome this issue of overestimation, DDQN is proposed by decomposing the original deep Q-network into an action selection network and an action evaluation network [6], that is, DDQN uses a target Q-network for the action evaluation and an evaluation Q-network for the action selection. Whereas the target Q-value of DDQN can be acquired as

$$\begin{aligned} q_{target} = r_p + \beta q(s_{p+1}, \arg \max _{a'\in {\mathcal {A}}}q(s_{p+1},a';{\textbf{W}});{\textbf{W}}'). \end{aligned}$$
(13)

Here we adopt DDQN to design our proposed algorithm for D2D underlay networks. Rather than distance based information used in \(D^4SA\) of [9], CSI based information can enhance the robustness of DDQN model of a underlay D2D network [10]. Then, the definitions of “state”, “action”, and “reward” of our proposed DDQN are given in the following.

(1) state: The agent observes the wireless environment by listening to the channel state \(c_p\) after taking action \(a_p\). The channel state is defined as \(c_p \in \{{\mathcal {I}}, {\mathcal {S}}, {\mathcal {R}}, {\mathcal {F}}\}\), where \({\mathcal {I}}\) means no transmission, \({\mathcal {S}}\) means just one transmission at the instant TS, \({\mathcal {R}}\) means a sharing between a D2D pair and a CU, while \({\mathcal {F}}\) represents that D2D pairs reuse a TS and cause the QoS requirement of the CU unsatisfied due to severe interference. The observation space of the centralized agent includes: the channel state \(c_t\), the channel power gain \(G_{B,k}^C, G_{i,k}^C, G_i^D, G_{k,i}^D\) and noise variance \(\sigma ^2\) in (1) and (2). Thus, the environment state at step p can be expressed as \(s_p = (c_p, G_{B,k}^C, G_{i,k}^C, G_i^D, G_{k,i}^D, \sigma ^2)\).

(2) action: The action set \({\mathcal {A}}\) is defined to reflect which DT to transmit at the current TS as \(\{0, 1, \ldots ,I\}\), where I is the number of D2D pairs. The action element \(a_p = 0\) means no transmission from D2D pairs, while \(a_p = i \ne 0\) means that DT i transmits signals at step p.

(3) reward: To ensure the SINR requirement of CUs, i.e., (7b)–(7e) are guaranteed, the reward vector at time t is defined as \({\textbf{r}}[t]=[r_1^D[t],\ldots ,r_I^D[t],r_1^C[t],\ldots ,r_K^C[t]]^T\), where the element \(r_i^D[t]\) and \(r_k^C[t]\) represent of the reward of the i-th D2D pair and of the k-th CU, respectively. Here, they are defined as \(r_k^C[t] = C_k^C[t], \forall k\in {\mathcal {K}}\) and

$$\begin{aligned} {r_i^D[t]} = {\left\{ \begin{array}{ll} C_i^D[t],\forall i\in {\mathcal {I}}, &{}\quad {\text {if }} (7b)-(7e) \text { are satisfied,}\\ {0,}&{}\quad {\text {otherwise.}} \end{array}\right. } \end{aligned}$$
(14)

During the training phase, each epoch contains several training steps wherein the agent interacts with the environment and stores the experience in the training data set. The pseudo codes of the proposed DDQN scheme are shown as in Algorithm 1.

Algorithm 1
figure d

DDQN for Resource Allocation

4 Simulation results

This section demonstrates the performance of the proposed algorithm and compare it with four benchmarks: overlay approach, \(D^4SA\) algorithm in [9], IF algorithm of our previous work [12] and exhaustive search method. Before describing the environment settings, we briefly discuss the considered algorithms for comparison. In the overlay approach, CUs and D2D pairs exclusively use orthogonal resource blocks and thus no resource sharing is allowed. In [12], the IF method aims to maximize the number of admissible D2D pairs by examining whether the SINR requirements of all the accessed CUs and D2D users can be met if a new D2D pair is admitted to access the network. While the exhaustive search method combinatorially select the optimal pairing among the possible arrangements which meet all the interference requirements of the network. In the considered environment, the locations of CUs and D2D pairs are randomly deployed in the cell coverage at each TS and the probability of CUs in the “shareable” area is approximated as \(\frac{d_{th}^2}{R^2}\), where R denotes the radius of the cell. The simulation parameters are listed in Table 1.

Table 1 Simulation parameters

Each DQN of the proposed DDQN consists of a 5-layer fully connected neural network (FCNN) with 3 hidden layers and each hidden layer has (8KI) neurons.Footnote 3 The rectified linear unit (ReLU) function, i.e., \(f(x)=\max (0,x)\), is applied as the activate function. For other parameters, the updating constant in (12) \(\delta\) is set to 0.01, and the discount factor \(\beta\) is 0.95. The exploration \(\epsilon\) in \(\epsilon\)-greedy algorithm is set to 1 at the beginning and decreases to 0.005 with step-size of 0.005. The update period C for the target DQN is 10. The batch size of experiences from \({\mathcal {D}}\), \(N_E\), is set to 30.

4.1 Loss function and throughput

In Fig. 2, the values of loss function in (10) for various transmission probabilities of D2D users are compared. We can find that the loss values converge in the training phase as the number epoches larger than 700 while the effect of various transmission probabilities is little. In Fig. 3, the achievable throughput among the above considered five methods is demonstrated for various D2D transmission probabilities. From this figure, one can find that the considered four underlay methods all outperform the overlay counterpart. Furthermore, the exhaustive search method enjoys the best performance at the cost of huge computational complexity, which is given in Table 2. However, our proposed DDQN outperforms \(D^4SA\) [9] and overlay approach with the same order of computational complexity. The main reason comes from that throughput is directly used as the reward in our proposed DDQN method rather than the number of accessed user links in \(D^4SA\). It is noteworthy that \(D^4SA\) utilizes dozens of length of state history, while the proposed DDQN just uses the current state for much less inputs and neurons within each layer.

Fig. 2
figure 2

Training performance comparison among various transmission probabilities of D2D users (\(K=4, I=2\))

Fig. 3
figure 3

Training throughput among various methods (\(K=4, I=2\))

Fig. 4
figure 4

Training throughput among various methods for different shareable CU numbers. (\(I=4\))

In Fig. 4, the transmission probability is set to 0.5. From this figure, one can also find that the gap of throughput between the proposed DDQN and \(D^4SA\) increases as the number of shareable CUs increases. The improvement of the proposed DDQN over \(D^4SA\) is more than 30% as the number of sharable CUs is 6. This phenomenon reveals that the reward function plays a critical role of resource allocation in deep learning approaches.

4.2 Complexity analysis

The computational complexity of the exhaustive search method is of order \({\mathcal {O}}(K^I)\), while the complexity of IF is on the order of \({\mathcal {O}}(K^2I)\) [12]. On the contrary, the complexity of the proposed method in training phase is seen to be dominated by the evaluation of \(q_{target}\) in (11), which is of order \({\mathcal {O}}(KI^2N_E)\) per epoch. Furthermore, the complexity of the proposed method in testing phase is just by the multiplication of FCNN, which is of order \({\mathcal {O}}(KI)\). The complexity comparison is listed as in Table 2, where \(N_{epo}\) and \(N_{prv}\) stand for the number of training epochs and the length of previous state information required in \(D^4SA\), respectively. It should be noted that the computational complexity of the IF and exhaustive methods is only evaluated in “Online Testing”, since no training procedure is needed for these two approaches.

Table 2 Complexity comparison among various spectrum allocation schemes

5 Conclusion

In this letter, we aim to maximize the overall throughput of the D2D pairs and the cellular users for the D2D communication underlay cellular networks. Based on the DRL philosophy, a novel centralized double deep Q-network (DDQN) is proposed to solve the non-convex problem with low complexity. Moreover, leveraging of the CSI-based information, the simulation results show that our proposed algorithm can outperform other DQN and non-learning approaches in terms of the achievable throughput. For further research, power control and more complex sharing principles among D2D pairs and CUs can be included to enrich the communication environment.