Keywords

1 Introduction

CWSN combines cognitive radio technology with Wireless Sensor Network (WSN) to address the problem of scarce spectrum resources by allowing a large number of sensor nodes as SUs to access the authorized spectrum. DSA is one of the key technologies in CWSN, and its task is to make a decision based on spectrum sensing data from cognitive sensor nodes to access a vacant spectrum licensed to a PU. However, when using this technique, the issues that need to be addressed are: how to minimise the interference to the PU while accessing and using the authorised spectrum, and how to avoid conflicts between SUs when multiple SUs try to access the same spectrum [1, 2].

Traditional optimization algorithms such as Game Theory, Particle Swarm Optimization and Genetic Algorithm have been used to address the DSA problem [3, 4]. Although these methods achieve spectrum reuse, their model design is complex, easily get trapped in local optima and less flexible and adaptive. In contrast, Reinforcement Learning (RL) can adaptively learn optimal strategies without a priori information in uncertain and dynamic complex environments. Therefore, in recent years, RL has been applied to DSA. In literature [5], a Q-learning based spectrum access algorithm is proposed to improve the transmission performance through intelligent utilisation of spectrum resources. Document [6] proposes a decentralised multi-intelligence reinforcement learning-based resource allocation scheme to address resource allocation problem without complete channel state information. The Q-learning used in the literatures [5, 6] performs well on small-scale models. However, it shows significant performance degradation when the state or action space is large. Deep Neural Network (DNN) is used in DRL to overcome this limitation. In literature [7], a centralised dynamic multichannel access framework based on DQN is proposed to minimise conflicts and optimise multi-user channel allocation through a centralised allocation policy. However, the centralised approach to spectrum access can lead to high communication overheads and may be difficult to implement in practice. In addition, the algorithm’s performance may be limited as it doesn’t account for imperfect spectrum sense that occur in real-world environments. Literature [8, 9] proposed using multi-intelligent deep reinforcement learning at medium access control layer for channel access. In this approach, users make transmission decisions through centralised training and decentralised execution to maximise the long-term average rates or to improve the performance of the network in terms of throughput, delay and jitter. However, this centralized training approach has single point of failure and necessitates high communication and computational resources, and decentralized execution requires transmission and synchronisation of parameters. In literature [10], a new DSA method is proposed for multichannel wireless networks that can find near-optimal policies in fewer iterations and can be applied to a wide range of communication environments. However, this method is limited as it targets at only one DSA user and does not consider the collision problem between SUs and PUs. The authors of [11, 12] employ reservoir computing or echo state networks, a type of Recurrent Neural Network (RNN), in DRL to enable distributed dynamic spectrum access for multiple users. They mitigate the effects of spectrum sensing errors by taking advantage of the temporal correlation of RNNs, thereby reducing conflicts among users. Nonetheless, the Q-networks used are complicated and the convergence speed of the algorithm needs to be improved.

2 System Model and Problem Formulation

We consider a multi-user, multi-channel CWSN environment with N PUs and M SUs. Figure 1 depicts the intricate association of desired links and interfering links when PU1, SU1, and SU2 operate on the same channel. We calculate the received signal of SUi on channel m:

$$ y_{i}^{m} = x_{i}^{m} \cdot h_{ii}^{m} + x_{m}^{m} \cdot h_{mi}^{m} + \mathop \sum \limits_{{j \in \Phi_{m,} j \ne i}} x_{j}^{m} \cdot h_{ji}^{m} + z_{i}^{m} $$
(1)
Fig. 1
A network of system model. 4 Directed links connect P U 1 T X to P U 1 R X, S U 1 T X to S U 1 R X, S U 2 T X to S U 2 R X, and P U 2 R X to P U 2 T X. 2 Sensing links connect P U 1 T X to S U 1 T X and S U 2 T X. 6 Interference links connect P U 1 T X to S U 1 R X and so on.

System model

where \(x_{i}^{m}\) represents the desired signal from SUi on channel m, while \(x_{m}^{m}\) and \(x_{j}^{m}\) represent interfering signals from PUm and SUj, respectively. Similarly, the variables \(h_{ii}^{m}\), \(h_{mi}^{m}\), and \(h_{ji}^{m}\) represent the channel gain from the transmitter to SUi at SUi, PUm, and SUj, respectively. Additionally, \(z_{i}^{m}\) represents additive white Gaussian noise (AWGN). The corresponding signal to interference plus noise ratio (SINR) is:

$$ SINR_{i}^{m} = \frac{{p_{i}^{m} \cdot \left| {h_{ii}^{m} } \right|^{2} }}{{p_{m}^{m} \cdot \left| {h_{mi}^{m} } \right|^{2} + \mathop \sum \nolimits_{{j \in \Phi_{m} ,j \ne i}} p_{j}^{m} \cdot \left| {h_{ji}^{m} } \right|^{2} + B \cdot N_{0} }} $$
(2)

where \(p_{i}^{m}\), \(p_{m}^{m}\) and \(p_{j}^{m}\) denote the transmit power of users i, m and j on channel m. B and N0 are the channel bandwidth and noise spectral density, respectively. The transmission rate \(C_{i}\) received by the receiver of SUi is:

$$ C_{i} = {\text{log}}_{2} \left( {1 + SINR_{i} } \right) $$
(3)

Equations (2) and (3) show that optimal for only one SU to transmit on an inactive channel.

We divide the spectrum hole of the authorised channel into multiple time slots. The channel occupancy as a two-state Markov chain, as shown in Fig. 2, where 0 represents an occupied channel and 1 represents a vacant channel. The transition probability of the two-state Markov chain on the ith channel is:

$$ p_{i} = \left[ {\begin{array}{*{20}c} {p_{00}^{i} } & {p_{01}^{i} } \\ {p_{10}^{i} } & {p_{11}^{i} } \\ \end{array} } \right] $$
(4)
Fig. 2
A diagram of two state Markov chain. Here arrows connect to states 0 and 1. 0 to 1 through P 0 1, 1 to 0 through P 1 0, 0 to 0 through P 0 0, and 1 to 1 through p 1 1.

Two-state Markov chain

where \( p_{xy} = \left\{ {{\text{the next state is }}x{\text{|the current state is }}y} \right\},\left( {x,y \in \left\{ {0,1} \right\}} \right)\).

2.1 State

At the beginning of each time slot, SUi conducts spectrum sensing on N channels to obtain information about the state of the channel. The state of the channel in the t-th time slot is expressed as follows:

$$ s_{i} = \left[ {s_{i}^{1} ,s_{i}^{2} , \cdots ,s_{i}^{N} } \right] $$
(5)

where \(s_{i}^{n} = 1\) or \( s_{i}^{n} = 0\). Since the spectrum detector is not perfect, the results of sensing the channel state may contain errors. We define the probability of sensing error for SUi on channel n as \(P_{i}^{n}\). Therefore, the probability of observing the true state \(o_{i}\) of the channel is given by:

$$ {\text{Pr}}\left( {o_{i} } \right) = s_{i} \cdot \left( {1 - P_{i}^{n} } \right) + \left( {1 - s_{i} } \right) \cdot P_{i}^{n} $$
(6)

The SU does not konw whether a spectrum sensing error will occur. Consequently, the observed results are mainly used as historical channel state data in this paper. The perception outcomes acquired by the SU in the presence of possible spectrum sensing errors are denoted as:

$$ o_{i} = \left[ {o_{i}^{1} ,o_{i}^{2} , \cdots ,o_{i}^{N} } \right] $$
(7)

2.2 Action

After spectrum sensing, the SU determines whether to access a channel based on the sensing result. The action of SUi is denoted by \(a_{i} \in \left\{ {0,1, \cdots ,N} \right\}\), where \(a_{i} = n\left( {n > 0} \right)\) indicates that at time slot t, SUi chooses to transmit on the nth channel, while \(a_{i} = 0\) indicates that SUi chooses not to transmit. The action of each SU is denoted as:

$$ A = \left\{ {a_{1} ,a_{2} , \cdots ,a_{N} } \right\} $$
(8)

2.3 Reward

SUs receive rewards based on the actions they take. Principles for SU access to a channel include minimizing collisions with other SUs and avoiding interference with the PU to maximize their own transmission rate. The reward function is defined as:

$$ r_{i} = \left\{ {\begin{array}{*{20}c} { - {\text{C}}} \\ 0 \\ {{\text{log}}_{2} \left( {1 + SINR_{i} } \right)} \\ \end{array} } \right.\begin{array}{*{20}l} {,{\text{collision with PU}}} \hfill \\ {,{\text{no channel access}}} \hfill \\ {,{\text{successful access}}} \hfill \\ \end{array} $$
(9)

Specifically, the reward is set to − C (C > 0) when the SU collides with the PU, and 0 when the SU does not transmit data. Otherwise, the SU's reward is the transmission rate of its receiver.

2.4 Policy

SUs don’t konw the probability of the channel state transmission and the sensing errors, so they use these rewards to form an access policy that maximizes their cumulative discounted returns, which can be expressed as:

$$ R_{i} = \mathop \sum \limits_{t = 1}^{\infty } \gamma_{t - 1} r_{i} \left( {t + 1} \right) $$
(10)

where \(\gamma \in \left[ {0, 1} \right]\) is a discounted factor.

In summary, the ultimate goal of DSA is to maximise the reward as given in Eq. (10). The optimal Q value is calculated using the following equation to find the optimal policy \(\pi^{*}\).

$$ \pi^{*} = \mathop {{\text{argmax}}}\limits_{{a_{i} \in A}} Q_{{\pi^{*} }} \left( {o_{n} ,a_{i} } \right) $$
(11)

3 Proposed DRL Algorithm

Since the efficiency of Q-learning deteriorates as the state and action space increases, we address the inefficiency of Q-learning by incorporating DNNs. The DQN architecture we use is shown in Fig. 3.

Fig. 3
A flow chart presents the framework of D Q N. It follows the environment to experience pool to loss function. experience pool follows two arrows, agent target network, and agent evaluation network. The loss function follows the evaluation network through gradient descent.

The framework of DQN

In the training phase of the DQN, as intelligent agent, each SU uses its observations at each time slot as input to the DQN evaluation network. The evaluation network selects actions using the ε-greedy strategy. After the SU takes action \(a_{i}\), it receives a reward \(r_{i}\) from the environment and inputs channel observations \(o_{i}{\prime}\) into the target network at the next time slot to obtain the next time slot action \(a_{i}{\prime}\) and the target Q value \(max_{{a_{i}{\prime} }} { }Q\left( {o_{i}{\prime} ,a_{i}{\prime} ;\theta^{\prime}} \right)\). \(\left( {o_{i} ,a_{i} ,r_{i} ,o_{i}{\prime} } \right)\) represents an experience that is collected and stored in the experience pool by the ε-greedy strategy before training starts. The accumulated experiences in the experience pool are used to calculate the loss value during the DQN training:

$$ loss = \left[ {r_{i} + \gamma \mathop {\max }\limits_{{a_{i} }} Q_{t}^{i} \left( {o_{i}{\prime} ,a_{i} ;\theta^{\prime}} \right) - Q_{e}^{i} \left( {o_{i} ,a_{i} ;\theta } \right)} \right]^{2} $$
(12)

The parameters θ of the evaluation network are updated using the calculated loss values through back propagation, and the parameters of the evaluation network are periodically copied to the target network to update its parameters θ’.

4 Simulation Results

We conducted simulation experiments in an environment where 2 SUs coexist with 6 PUs, and their positions were randomly set within a 150 m × 150 m area. The SUs were placed within range of 20–40 m from each other. We used the WINNER II and Rician models to calculate the path loss and channel model, respectively. We randomly selected \(p_{11}\) from the uniform distribution [0.7, 1] and \(p_{00}\) from [0, 0.3]. We then calculated \(p_{10} = 1 - p_{11}\) and \(p_{01} = 1 - p_{00}\). The parameters of the system model are shown in Table 1.

Table 1 Parameters of system model

To improve the training accuracy and address the performance degradation of deep neural networks due to network depth, we designed the DNN structure in our DQN as a ResNet structure with four hidden layers, as shown in Fig. 4. Each hidden layer contains 64 neurons with Rectified Linear Unit (ReLU) as the activation function. In order to avoid sub-optimal decision strategies before gaining sufficient learning experience, we used the decaying ε-greedy algorithm with an initial value of ε set to 1. At each time slot, ε was decayed according to ε ← max{0.995*ε, 0.005}. The hyperparameters are provided in Table 2.

Fig. 4
A flow chart of deep neural networks for D Q N algorithm. It starts from input S and follows hidden layer 64 step 4 times and gives output Q. After the second hidden layer 64 step it follows the output through Res Net block.

The structure of deep neural networks for DQN algorithm

Table 2 Hyperparameters of DQN algorithm

We conducted simulations using Python and TensorFlow to evaluate the performance of our proposed algorithm DQN + MLP4 + ResNet against several other algorithms: myopic algorithm [13], DQN + RC [11], Q-learning, and DQN with only four fully connected layers (DQN + MLP4). We compared the algorithms based on their cumulative rewards, success rate, and conflicts with PUs and other SUs.

Our proposed algorithm has demonstrated superior performance compared to other algorithms, as shown in Figs. 5, 6, 7 and 8. Figure 5 shows that our algorithm achieved the highest average reward compared to other algorithms, while Fig. 6 shows that our algorithm achieved a much higher access channel success rate, reaching approximately 95%. Figure 7 shows that all learning-based algorithms, except for the myopic policy, eventually reach a zero conflict rate with other SUs, indicating that they learn the access policies of other SUs by interacting with the environment. However, the myopic policy only accesses the channel that brings the maximum expected reward based on the known system channel information, and cannot learn the access policies of other SUs. To prevent conflicts with PUs, we set the reward to -2, and as depicted in Fig. 8, our proposed algorithm achieves the lowest collision rate with PUs, even lower than the myopic policy.

Fig. 5
A multiline graph of average reward versus training steps. It presents variation curves of D Q N plus R C, Myopic, D Q N plus M L P 4, D Q N plus M L P 4 plus Res Net, and Q learning. The D Q N plus M L P 4 curve indicates the maximum average reward of nearly 5.5.

The average reward

Fig. 6
A multiline graph of average success rate versus training steps. It presents variation curves of D Q N plus R C, Myopic, D Q N plus M L P 4, D Q N plus M L P 4 plus Res Net, and Q learning. D Q N plus M L P 4 plus Res Net curve indicates the maximum average success rate of nearly 1.

The average success rate

Fig. 7
A multiline graph of average collision rate with S U versus training steps. It presents variation curves of D Q N plus R C, Myopic, D Q N plus M L P 4, D Q N plus M L P 4 plus Res Net, and Q learning. The curve of Myopic indicates the maximum average collision rate of nearly 0.5.

The average collision with SU

Fig. 8
A multiline graph of average collision rate with P U versus training steps. It presents variation curves of D Q N plus R C, Myopic, D Q N plus M L P 4, D Q N plus M L P 4 plus Res Net, and Q learning. The curve of D Q N plus R C describes the maximum average rate variation.

The average collision with PU

5 Conclusion

This study addresses the spectrum access problem in distributed DSA networks with spectrum sensing errors, and proposes a DSA algorithm that combines DQN with ResNet. Simulation results demonstrate that the proposed DQN + MLP4 + ResNet algorithm facilitates SUs to learn the optimal channel access policy more efficiently, improves spectrum access opportunities, and effectively reduces inter-user collisions when SUs have incomplete knowledge of the environment and face certain perception errors. In future work, we plan to consider more practical spectrum sharing scenarios and further improve the performance of the algorithm.