1 Introduction

Cognitive Radio has been proposed in order to overcome the spectrum scarcity problem. Unlicensed, namely known as Secondary User (SU) may opportunistically access the channel of the licensed user known as primary user (PU) when the latter is absent [1]. Thus, one of the most important functions in CR becomes the spectrum sensing (SS), which is responsible to verify the primary channel status whether it is occupied or not. Several detector have been proposed to perform the SS tasks, such as: energy detector (ED), auto-correlation detector (ACD) and cyclo-stationary detector (CSD) [2].

In classical SS, i.e. signal detection, the SU applies a test statistic (TS) on the received signal and compares it to a predefined threshold in order to make a decision on the PU status. If the TS is above a certain threshold, then PU is considered as active. In fact, in order to set the optimal threshold that meets the target detection and false alarm rates, this approach predetermines that the statistical distribution of TS is known, which is not always possible due to the unstable, and may be unknown, statistical properties of the noise, the PU signal or the transmission channel.

To overcome the analytic statistical problems of the classical SS and improve its performance, several published works propose the adoption of the machine learning (ML) and the neural networks (NN) techniques in order to make decisions on the PU channel occupancy [3,4,5,6,7,8,9]. The main aim of the proposed works is to tune ML or NN systems with the statistics of both hypotheses: the first one is \(H_0\) when PU is assumed to be absent, and \(H_1\) when PU is assumed to be active.

In [5], ML techniques such as the K-means and support-vector machine (SVM) are used to distinguish between the \(H_0\) and \(H_1\) hypotheses in a cooperative SS. Two low-dimension probability vectors related to both \(H_0\) and \(H_1\) of ED are used in order to train the system. SVM is used in order to set the threshold curve between \(H_0\) and \(H_1\) clusters. K-nearest-based ML is adopted in [10] for a cooperative SS. The related mechanism of the proposed work is divided into two phases: training and classification. The global decision of the presence/absence taken at the end of the classification phase of the PU takes into consideration the reliability of each CR user when reporting to the fusion center during the training phase.

For a local SS, an ensemble classifier is proposed in [11]. The classifier seeks to discriminate between \(H_0\) and \(H_1\) hypotheses by being trained with the extracted cyclic features of PU’s signal in low SNR conditions. This ensemble classifier is based on decision trees and AdaBoost algorithm. Wideband SS is tackled in [12], where three ML techniques: neural networks, expectation maximization and k-means are used in order to detect presence of one or multiple primary users in a wideband spectrum.

In order to enhance the accuracy of the ML system in making decision on the PU status, hybrid SS (HSS) has been proposed [6, 7]. HSS consists of making a sensing decision based on several detectors instead of considering only one as per the classical SS. In [6, 7], Artificial neural network (ANN) have been applied in order to perform a HSS. ANN is trained using the TSs of two detectors related to \(H_0\) and \(H_1\) (in [6] ED and cyclostationary detector (CSD) are used and in [7] ED and likelihood ratio statistics are used).

The strength of the HSS consists on compensate the weak points of a given detector by the advantages of the another one. For instance, ED suffers from the noise uncertainty at low SNR, which is overcome by ACD. In return, ACD is adversely impacted by the low oversampling rate of the PU signal, while ED is not affected by this issue. A HSS scheme is proposed in [13], where ED and CSD are adopted. First, ED is evaluated to verify whether primary user is present or not. The CSD is used when energy detector is not sure about the presence or absence of PU. Moghimi et al. [14] and Cardenas-Juarez et al. [15] exploit the ED and the waveform detector (WFD) which is coherent detector that is based on the correlation of the received PU signal with a known reference of this signal. An optimal hybrid detector based on ED and WFD is derived as a linear combination of an energy detection metric and a coherent correlation metric.

However, the classical dealing with the HSS requires the knowledge of some statistical features of the combined detectors. This may be hard to obtain since the PU signal’s statistical parameters are not always known/available. This fact makes the numerical techniques such as NN an efficient solution. In return, even when NN was used in literature, the hybridization was limited to two detectors as in [6, 7], which does not reflect the real potential of such technique.

In this paper, we present a more general study on the performance of the HSS by admitting up to six different detectors. ANN are trained by the TSs of the detectors using data related to \(H_0\) and \(H_1\). A discussion on the performance is presented according to several criterion related to the ANN itself and the number of detectors to be combined in HSS. Regarding the ANN system, a discussion on the number of layers and the number of nodes in each layer is detailed showing the effect of them on the accuracy of the decision on the PU channel status. For the adopted detectors, the performance is evaluated based on the Probability of Detection, PD, and the False Alarm Rate (FAR). In addition, the impact of the number of combined detectors in HSS on the performance is detailed.

The remaining of this paper is organized as follows. In Sect. 2, our system model on the PU signal and the noise is presented. The data model, the neural network model, and the discrimination process between the two hypotheses \(H_0\) and \(H_1\) are given in Sect. 3. Numerical results and discussions are provided in Sect. 4. Finally Sect. 5 concludes our work.

2 System Model

The decision in SS is binary where two hypotheses must be distinguished \(H_0\) and \(H_1\):

$$\begin{aligned} {\left\{ \begin{array}{ll} H_0:\text {PU is absent}\\ H_1: \text {PU is active} \end{array}\right. } \end{aligned}$$
(1)

The measured TS value leads SU to decide on the PU activity by comparing TS to a predefined threshold.

Accordingly, two classes of TS values have to be defined: \(H_0\)-class and \(H_1\)-class related to the hypotheses \(H_0\) and \(H_1\) respectively. In fact, \(H_0\)-class only depends on the system parameters such as the noise and the hardware imperfections, in other words it is independent from the PU signal since the received signal r(n) can be presented as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} r(n)&{}=w(n) \text { under } H_0\\ r(n)&{}=s(n)+w(n)\text { under } H_1 \end{array}\right. } \end{aligned}$$
(2)

where w(n) stands for an additive white Gaussian noise (AWGN) and s(n) is assumed to be the received PU signal to be detected.

For HSS, the SU evaluates a \(m\times 1\)-dimension vector V related to m detectors.

$$\begin{aligned} V=[T_{D_1}, T_{D_2}, \ldots , T_{D_m}]^{tr} \end{aligned}$$
(3)

where the upper script tr stands for the transpose operation, \(T_{D_i}\) is the TS related to the detector \(D_i,\ i\in [1,m]\). Each TS is a mathematical application applied on r(n). For instance, ED evaluates the sum of squares of the samples of r(n), whereas ACD stands for the correlation between r(n) and a shifted version of itself, and so on. In classical SS, SU may evaluate only one TS related to a given detector. This TS is compared to a threshold to take a decision on PU status. However, in HSS, a vector of TSs related to several detectors are evaluated and combined in order to examine the PU channel status. In our work, this ANN is used to combine the data of these detectors and exploit them in outcome a final decision on PU.

3 The Data Model

In this section we present the details about our dataset and the ANN model used in order to combine the evaluated TSs of the adopted detectors. By training the ANN system with hybrid data, we use such system to make a decision on the PU status.

3.1 Dataset

The data consists of two categories according to the two hypotheses \(H_0\) and \(H_1\). The data was generated corresponding to the TSs of six detectors: ED [16], ACD [17], maximum eigenvalue detector (EVM) [18], maximum–minimum eigenvalue detector (EVMM) [18], cumulative power spectral density detector (CPSD)[19] and goodness-of-fit detector (GoF) [20]. The data respects an AWGN noise and a 16-QAM modulated PU signal with an oversampling rate \(N_s=4\). The TSs related to the adopted detectors are given in the “Appendix”. Our dataset, as depicted by Fig. 1, consists of seven features which are \(\{\text {ED, ACD, EVM, EVMM, CPSD, GoF, SNR}\}\) and a label. The label values are 0 under hypothesis \(H_0\) and 1 under \(H_1\). Figure 1 presents a description of the dataset. In particular, the dataset contains \(9 \times 10^6\) rows. Our choice to include the SNR into the set of features is an important issue. Indeed, this prevents building a separate neural network model (NN model) and from training it over each SNR value.

Fig. 1
figure 1

Dataset description: \(9 \times 10^6\) rows; 7 features (6 detectors and SNR); the label has two possible values: 0 for hypothesis \(H_0\) and 1 for hypothesis \(H_1\). The mean, min, max, standard deviation and percentiles (\(25\%\), \(50\%\) and \(75\%\)) of the features and the label are also presented

We splitted the dataset randomly into \(80\%\) training set and \(20\%\) validation set. Figure 2 illustrates the count of rows with \(H_0\) and \(H_1\) respectively (i.e. labels 0 and 1) in the validation set. It can be observed that the data is uniformly distributed among all SNR values. This also applies to the training set.

Fig. 2
figure 2

Histogram of Dataset to show the distribution of the data over the hypotheses \(H_0\) and \(H_1\). These two hypotheses are uniformly considered in our simulations with respect to various SNR values

In order to carefully analyse the data may look in depth, we picked out 1000 random samples from the validation dataset and we plot the scattering of two detectors: ED and ACD as depicted in Fig. 3. The \(H_1\) data drifts away from the \(H_0\) data class as the SNR increases. \(H_0\) data keeps the same place in the space of the scattering for all SNR values because it is only related to the noise. However, low SNR values (i.e. − 21 dB) makes the discrimination between \(H_0\) and \(H_1\) a tough task due to the huge mix-up of \(H_0\) and \(H_1\) related data (see Fig. 3). However, at a relatively good SNR value (i.e. 6 dB), the classification becomes an easy task.

Fig. 3
figure 3

The scattering of (\(\xi ,\alpha\)) for \(\hbox {N}=1500\) samples, 10,000 trials and different values of SNR

The data input of the model is a batch of 64 rows (see Sect. 4.1 for a discussion on the batch size). Figure 4 illustrates the first 10 rows of a batch drawn randomly from the dataset.

Fig. 4
figure 4

10 rows from a batch

We iterate on the training set by selecting a batch on each step and we fed it as an input to Algorithm 1. After completing a whole pass on the training set, we switch to the validation set and we apply Algorithm 2 in order to assess the accuracy of the model. This completes one epoch. This procedure can be repeated until getting an acceptable value of the accuracy (e.g. an accuracy value \(> 95\%\)).

3.2 The Neural Network Model

Since our data is in tabular form, we select a fully connected neural network (FCNN). A FCNN consists of one input layer, several hidden layers and one ouput layer. The features’ set is the input layer for our model. The output layer will simply consist of two nodes because we are trying to predict whether a row of features’ values belongs hypothesis \(H_0\) or \(H_1\). That is, the values of the two output nodes will be two probability values that sum to one. It remains to specify the number of hidden layers, i.e. the ones between the input and output layers,

The number of hidden layers and the number of nodes in each layer, are considered as hyper-parameters and can be tweaked. Two layers are considered. The first layer with 1000 nodes and the second one with 500 nodes. We give a discussion of the model parameters’ tweaking in Sect. 4.1.

As a subtle point, notice that the SNR has discrete values, hence it is considered as a categorical variable as opposite to the six detector variables which are continuous. It is a common behaviour to use embedding [21] in the case of a categorical variable since it leads to improve the model accuracy. The embedding process is shown in Fig. 5. In this figure, we take a one-hot encoded vector [21] of SNR concatenated with a bias (i.e. a real value which will be learnt by the NN) which yields a vector of length 10. This vector is mapped to a vector of length 6, called the embedding vector. The embedding vector dimension is a hyper-parameter and can be tweaked (Sect. 4.1). A bias is added because this is required by the embedding process. We concatenate this \(6-D\) vector with the six detectors (Eq. 3) in order to produce the input layer of the FCNN (Fig. 5). Then, we add two hidden layers with [1000, 500] nodes and an output layer with 2 nodes.

Fig. 5
figure 5

The NN model archirecture: On the left we see the embedding layer. The output of the embedding layer, which is a vector of 6 real values, is concatenated to the vector of 6 detectors (6 real values) in order to produce the input layer of the NN (a vector of 12 real values). Then we add sequentially: Hidden Layer 1 (a vector of 1000 real values), Hidden Layer 2 (a vector of 500 real values) and the output layer (a vector of 2 real values)

For the performance metrics, we select the binary negative log likelihood (NLL) loss function [22] because the type of our problem is binary classification.

A brief explanation of the NLL loss function is given hereinafter: Let us take a features’ row from the dataset. The ground truth label (or target) of this row is 0 or 1 (e.g. the first row in Fig. 4, has a ground truth label \(=0\)). After SNR embedding and concatenation with the other features as explained before, we get a vector x of dimension (12, 1) (the input layer in Fig. 5). Call the output layer \(\hat{y} = [\hat{y_0}, \hat{y_1}]^{tr}\) where \(\hat{y_i}, i=0,1\) is the probability of getting \(H_i, i=0,1\) as prediction and the upperscript tr is the transpose operator. We encode the ground truth label using one-hot encoding [21]. That is, label 0 is encoded as vector \(y=[1,0]^\top\) whereas label 1 is encoded as \(y=[0,1]^{tr}\). That is \(y=[y_0, y_1]\) where \(y_0=1\) if label = 0 and \(y_0=0\) if label =1. Note that, \(y_1=1-y_0\). The binary NLL loss function for this row (e.g. row 1) is expressed as:

$$\begin{aligned} L_1=-y_0\log (\hat{y_0})-(1-y_0)\log (1-\hat{y_0}) \end{aligned}$$

For a batch of 64 rows, the loss function becomes:

$$\begin{aligned} L=\sum _{n=1}^{64}-y_{n0}\log (\hat{y}_{n0})-(1-y_{n0})\log (1-\hat{y}_{n0})/64 \end{aligned}$$
(4)

where \(y_{n0}\) (resp. \(\hat{y}_{n0}\)) is the encoded label value (resp. predicted probability) of row n of the batch. During the training phase (Algorithm 1) the model will try to minimize the loss function. During the validation phase (Algorithm 2), the loss is also calculated. In addition, we get the confusion matrix and we will derive from it the model accuracy. Furthermore, we well obtain two other important metrics which are the detection probability and the false alarm rate (these two also are derived from the confusion matrix). An example is given in Fig. 6 where the True Positive \(TP=898{,}399\), the False Positive \(FP=62{,}934\),the False Negative \(FN=2246\) and the True Negative \(TN=836{,}421\). Hence, we get the accuracy as :\(\frac{TP+TN}{P+N}=0.9637\) (\(P+N\) is the total count of the validation set which is 1800, 000). Consequently, the Detection Probability PD can be evaluated as: \(PD=\frac{TP}{TP+FP}=0.9345\) and the False Alarm Rate FAR is: \(FAR=\frac{FN}{FN+TN}=0.002678\).

Fig. 6
figure 6

An example of the confusion matrix on the validation set showing the actual and predicted values, where 0 (resp. 1) represents \(H_0\) (resp. \(H_1\))

The details of the model are described in Fig. 7. First, an embedding layer is constructed as discussed before. Then, we apply a regularization technique called DropoutFootnote 1 [23]. Dropout consists of dropping a percentage of a layer nodes randomly in the training process. This percentage is determined by the value p in Fig. 6. For the embedding layer, we put \(p=0\), that means we do not drop any node since the number of nodes in this layer is too small (6 nodes). Normalization is also an important procedure in FCNN, which is normally used in order to avoid the cases where the NN parameters vanish or explode. Batch normalization [24] is very efficient and hence we applied it to all the layers except the output. Equation 5 is the core operation in batch normalization.

$$\begin{aligned} y = \frac{x-E(x)}{\sqrt{Var(x)+\epsilon }}*\gamma +\beta \end{aligned}$$
(5)

x represents a batch, E(x) and Var(x) are the mean and the variance of x respectively, \(\epsilon\) is added to ensure numerical stability, and \(\beta\) and \(\gamma\) (affine\(=True\)) are two learnable parameters. Also by default, during training this layer keeps running estimates of its computed mean and variance (track_running_stats\(=True\)), which are then used for normalization during evaluation. The running estimates are kept with a default momentum of 0.1Footnote 2. After normalization, a linear layer is added (Eq. 6):

$$\begin{aligned} y=W^{tr}.x + b \end{aligned}$$
(6)

where W is a learnable parameter matrix, x is the batch, . is the dot product and b is a learnable bias vector. For instance, the first linear layer model connects the input layer (12 nodes) to the first hidden layer (1000 nodes) as shown in Fig. 5. Given a batch size \(=64\), hence the dimension of matrix W becomes (12, 1000), whereas the dimensions of x are (12, 64) and those of b are (1000, 64).

Fig. 7
figure 7

The NN model details: we begin by an embedding layer transforming a list of 10 values [i.e. 9 SNR values and a bias (see Fig. 5)] to a vector of 6 real values. Then we apply batch normalisation to it and we concatenate it with the 6 detectors’ values. Then we apply sequentially two hidden layers and on each layer we apply ReLU, batch normalisation and dropout. Finally, we add the output layer

After adding the linear layer, we introduce a non-linearity by applying an activation function. In our case, it is the ReLU (Rectified Linear Units) function [25]. ReLU is simply max(0, y), to get rid of negative values.

As mentioned before, the model contains two phases: training and validation (see Algorithms 1 and 2). Note that the backward pass is applied during the training phase only; Where the parameters of the model are updated in order to minimize the loss function. The validation phase, however, contains only a forward pass. Note also the Dropout is turned off during the validation.

figure f
figure g

4 Results

4.1 Model Tweaking

We tested several model architectures with various numbers of layers and different number of nodes per layer.

Table 1 Accuracy as function of different model architectures

The results reported in Table 1 are after one epoch of training, since the accuracy was almost independent from the number of epochs. We conducted our experiments on a cloud AWS (Amazon Web Service) machine equipped with a k80 GPU (12 GB integrated RAM; 5.6 TFLOPS [27]). It is clear that increasing the number of layers and the number of nodes per layer leads to better accuracy. However, we did not notice an accuracy improvement with a number of layers more than two. Also, we increased the number of nodes to the maximum value allowed by the machine RAM. In addition to the number of layers and the number of nodes, there are other hyperparameters to tweak. The most important one is the learning rate. We applied the methodology suggested in [28] in order to select a learning rate which minimizes the loss function. The result is illustrated in Fig. 8. We obtained this figure by applying algorithm 1 on a small percentage of the training set (5% in our case).

Fig. 8
figure 8

Selection of the learning rate

According to [28], the learning rate should be selected from the decreasing zone in Fig. 8. That is, in the range \([10^{-5}, 10^{-1}]\). In our experiments we used the value \(10^{-5}\).

Other parameters are: batch size, momentum, epsilon, dropout probability and the length of the embedding vector.

For the batch size, we selected a value of 64 (a larger value can be used but this requires more RAM). For the embedding vector length, the best practice [21] is to reduce the dimension of the categorical input vector (SNR vector in Fig. 5). Hence, any value less than 9 is acceptable. In our experiments, we fixed this value to 6. For the remaining parameters, we used momentum = 0.1 ([26]), epsilon = \(10^{-5}\) (this should be a number close to 0 [24]) and dropout probability \(p = 0.001\) for the hidden layer 1 and \(p=0.01\) for the hidden layer 2 (p should be a small percentage of the nodes’ layer). With these parameters, we obtained a high accuracy value (0.96) for the model architecture with 2 layers, [1000, 500] nodes. Also, as illustrated in Fig. 9, validation and training losses are very close which means that our model does not over-fit, i.e. it can generalize well to any dataset.

Fig. 9
figure 9

The loss value as function of the processed batches

4.2 Sensing Performance Evaluation

In this section, we present results obtained from our model. We emphasize on two performance measures: the probability of detection (PD) and the false alarm rate (FAR). Our dataset contains six detectors which are: ED, ACD, EVM, EVMM, CPSD and GoF. We may present results for any combination among these detectors; However this will be a time consuming. Instead, we take the following set of combinations where ED is common in all the adopted combinations: \(\{ED,\ ED-EVM,\ ED-EVM-GoF,\ ED-EVM-GoF-EVMM,\ ED-EVM-GoF-EVMM-CPSD,\) all detectors\(\}\). Our assumption comes form the fact that ED is the classical detector in SS and is widely considered as the reference one, thus ED is common in all the considered combinations.

Figure 10 shows the evolution of PD and FAR of ANN-based HSS detector in terms of SNR for all the adopted combinations. Noting that adopting ED solely reflects the classical case when ANN is used to train/validate only one detector, thus it can be considered as the reference of the non-HSS. However, for the combination \(ED-EVM\), PD increases from 0.6 at SNR = − 24 dB to a value greater than 0.95 at SNR of − 12 dB. This evolution of PD is accompanied with a decrease of FAR from 0.06 at SNR = − 24 dB to a value less than 0.1 at − 12 dB. On the other hand, for the ANN-based ED (no HSS is adopted) PD increases from 0.65 to 0.85 for the SNR range [− 24 ; − 12] dB, while FAR presents very high values compared to \(ED-EVM\) on such SNR range.

Furthermore, Fig. 10 shows that PD increases with the number of used detectors, whereas FAR decreases with the number of used detectors. When three detectors are used, i.e. \(ED-EVM-GoF\), PD achieves 0.92 at − 12 dB and FAR becomes less than 0.06 for the same SNR. These two performance indicators, PD and FAR, become respectively higher than 0.95 and less than 0.001 when six detectors are used. This fact reflects the efficiency of the hybrid sensing in terms of both protecting PU form the interference (when PD is high) and exploiting the available spectrum resources (when FAR is low).

However, for very low SNR, i.e. − 24 dB, PD is above 0.825 with a FAR less than 0.001, which reveals the high robustness of such a hybrid detector in achieving good performance when the other techniques fail.

In Fig. 11, we present the average values of PD and FAR over all SNRs. The average could be interpreted as the robustness of the proposed technique in terms of SNR. In fact, the data corresponding to \(H_0\) are noise-only related and not impacted by the SNR, thus their detectors scattering remains stable in the space independently of the SNR. On the another hand, the data under \(H_1\) is PU signal dependent, and subsequently it is related to the SNR of the received PU signal. Hence, the performance analysis presenting the average PD and FAR gives us an in-depth view on the efficiency of the proposed technique to distinguish between \(H_0\) and \(H_1\), for wide range of SNR ([− 24 ; 0] dB). For the case where no HSS is used, i.e. only ED is used, thre average PD is around 0.84 for an average FAR of 0.25 as shown in Fig. 11 respectively. In contrast, for HSS when the number of used detectors increases the average PD increases accordingly, whereas the average FAR decreases. An average PD higher than 0.93 is observed when more than 3 detectors are used, while an almost zero FAR is obtained.

Fig. 10
figure 10

Evaluation of PD and FAR in terms of SNRs

Fig. 11
figure 11

The average PD and FAR for the SNR range [− 24 ; 0] dB for the used combination in the proposed ANN-based HSS technique

5 Conclusion

In this paper, we presented hybrid spectrum sensing (HSS) technique using artificial neural network (ANN). Instead of using one detection method as per the classical spectrum sensing, several test statistics (TSs) of several detectors are combined using ANN. ANN system is trained with the TSs of the used detectors for the noise-only case and for the case where PU is active. The numerical results corroborate the efficiency of the proposed HSS compared to the non hybrid detection technique, where ANN is trained with the TS on only one detector. In addition, the results proved that the detection outcome becomes more reliable as the number of detectors increases.