1 Introduction

Big data communications in high connection density networks are ever-growing because of the increasing demands in social networks [1, 2], augmented reality [3], etc., under the background of the Internet of Things. Spectral efficiency optimization can enhance the throughput to ease this problem. This motivates novel resource allocation designs in multiple input multiple output (MIMO) cognitive radio (CR) to address such high data rate applications. Cognitive radio has been known as a technology for reutilizing the spectrum licensed for a primary radio (PR) [4]. In the underlay CR mode, both PR and CR can concurrently transmit without inducing harmful interference by CR transmitters at PR receivers in that the CR interference power at PR is restricted below a predefined threshold [5, 6].

At the heart of today’s wireless networks, e.g., IEEE-based [7] and 3GPP-based [8], the combination of orthogonal frequency division multiplexing with the multiple input multiple output technologies is a feature for addressing physical (PHY) and medium access control (MAC) layers challenges. Consequently, the communication system possesses large number of radio resources and degrees of freedom for which the management processing demands low complexity algorithms to fulfill requirements of emerged applications and use cases. This is a challenging aspect in the conventional algorithms especially for the CR networks. For convenience, minimum mean square error (MMSE) algorithm [9] demands complex operations such as matrix inversion in every iteration. Interior point algorithm based on barrier method [10] employs centric iterations in which complex Newton step is executed. Iterative waterfilling algorithm [11] involves singular value decomposition at each iteration. Lagrange algorithms [12, 13] solve closed-form expressions in terms of iterative dual variables associated with the optimization constraints. Generally speaking, the aforementioned iterative optimization algorithms are by nature computationally expensive. That makes their real-time implementation challenging especially for large-scale wireless communication system whose algorithm execution is periodically required in time frames of milliseconds (due to the very dynamic system parameters such as channel conditions, number of users, etc.).

On the other hand, machine learning has been a research trend in several processings fields of wireless communications such as resource and interference management [14, 15], channel state information feedback and estimation [16, 17], antenna selection and beamforming design [18, 19] and others. The reason behind is the potential computational efficiency of deep learning solvers in optimization problems.

Hence, this paper proposes an attention-based deep learning algorithm to address a novel power allocation design in CR MIMO systems. Our contributions are summarized as follows.

  1. 1.

    We developed a fair per-antenna power allocation scheme in CR MIMO systems. The proposed scheme addresses the issue of opportunistic unfair power allocation in that all users can have the basic quality of service (QoS)Footnote 1 by setting lower bounds during the configuration.

  2. 2.

    We propose an attention-based convolutional neural network (Att-CNN) method to implement the fair power allocation over antennas and users. We design two types of attention mechanisms: Cross-channel \(h_0\) based and direct channel \(h_k\) based, which support the whole neural network to focus on the relationships between \(h_0\) and \(h_k\), and inside \(h_k\), respectively. In addition, the CNN effectively reduces the floating point operations (FLOPs) and number of network parameters.

  3. 3.

    Beside the implementation of Att-CNN model, we also implement the proposed per-antenna power allocation paradigm using fully connected neural network and equal power allocation methods. Main findings show that the Att-CNN outperforms those baselines.

Throughout the paper, \((.)^T\) and \((.)^H\) are the transpose and conjugate transpose, respectively. \(\Vert .\Vert \) denotes the 2-norm. \([{\varvec{X}}]_i\) denotes the ith column of the matrix \({\varvec{X}}\). For real value x, \([x]^+=\max (x,0)\). \({\mathbb {R}}^{m\times n}\) denotes the space of real \(m\times n\) matrices, respectively.

The sequel of this paper is organized as follows: Sect. 2 exhibits the system model and the problem statement. Section 3 derives the proposed fair per-antenna power allocation by means of the Att-CNN. Section 4 presents the experimental results. Section 5 introduces the related work. Finally, Sect. 6 concludes the paper.

2 Problem statement

We consider a CR network including a CR base station (CR-BS) and coexisting with a PR base station (PR-BS). The CR-BS is equipped with \(N_t\) antennas communicating with K single-antenna CR users (CUs) where \(N_t,K\in {\mathbb {R}}^+\). The PR-BS is a single antenna system serving a primary user. The antenna system of the CR-BS is a massive MIMO. Mutual interference is assumed according to the underlay CR setting in which CR induces interference at the primary user but below a predefined threshold, while the PR-BS introduces interference at the CR users without restrictions. The channels between the CR-BS antenna i and CR user k, the CR-BS antenna i and the PR user, the PR-BS and the CR user k and the PR-BS and PR user are denoted as \(h_{k,i}\), \(h_{0,i}\), \(g_k\) and \(g_0\), respectively, where \( 1 \le k \le K\) and \(1\le i \le N_t\). The total power of PR-BS and CR-BS are denoted as \(P_{P\!R}\) and \(P_t\), respectively. The noise power and interference threshold caused by the CR-BS at the PR user are defined as \(\sigma \) and \(I_{th}\). System model is illustrated in Fig. 1, and all parameters are listed and explained in Table 1.

Fig. 1
figure 1

System Model: MIMO CR network coexists with a PR network

Table 1 Definitions of system model parameters
Fig. 2
figure 2

Heat map of antennas power allocation. It shows an example for unfair power allocation scheme that assigns all the power to a single CR user (8th index)

Fig. 3
figure 3

Structure of the proposed attention-based convolutional neural network

In this paper, we intend to optimize the spectral efficiency of CR user k in the proposed underlay CR network, which can be written as follows:

$$\begin{aligned} S\!E^{k}_{C\!R}\!=\!\log _2\left( \!\!1\!\!+\!\!\dfrac{\parallel \sum _{i=1}^{N_t} h_{k,i}P_{k,i}^{1/2}\parallel ^2}{\sigma ^2\!+\!\sum \limits _{l\ne k}\!\parallel \sum _{i=1}^{N_t}h_{k,i}P_{l,i}^{1/2}\parallel ^2\!+\xi }\!\right) , \end{aligned}$$
(1)

where \(\xi =\parallel g_k \parallel ^2 P_{P\!R}\) is the interference from the PR-BS.

Intuitively, the optimization problem of antennas power allocation can be stated as follows:

$$\begin{aligned}&S\!E_{C\!R}=\max _{\{P_{k,i}\ge 0 \}}\sum _{k=1}^{K}S\!E_{C\!R}^k, \nonumber \\&\quad s.t.: \nonumber \\&\quad C1: \qquad \quad \!\!\! \sum _{k=1}^{K}\sum _{i=1}^{N_t}P_{k,i} \le P_t,\nonumber \\&\quad C2: I_{C\!R} = \sum _{k=1}^{K}\parallel \sum _{i=1}^{N_t}h_{0,i}P_{k,i}^{1/2}\parallel ^2 \le I_{th}. \end{aligned}$$
(2)

Here, the optimizing target is to maximize the SE of all cognitive users \(S\!E_{C\!R}\) in the CR network, under the constraints of total power \(P_t\) and interference threshold \(I_{th}\). However, when we addressed this optimization problem with tuned deep neural networks, we met a new problem. Figure 2 shows a heat map example for unfair power allocation. The horizontal and vertical axes are users and antennas indices, respectively. It is clear that a single CU or even several antennas of that user are allocated almost all the power. This is unfair to other users, because all have the same priority. The fair scheme involves more spatial degrees of freedom in the MIMO system via serving more users, i.e., multiuser MIMO. However, the unfair schemes engage single user underutilizing the spatial degrees of freedom, i.e., single-user MIMO, given that the antenna condition \(N_t \ge K\) holds [9, 12]. Thus, the benefit is manifold: better QoS and higher spectral efficiency performance. It is worth noting that the global maximum solution can only be found by an exhaustive search, which is based on trial-and-error method whose computational complexity is inevitably intractable for real-time communications.

Therefore, we anchor the optimization problem to fair antennas power allocation and add a new constraint (i.e., user minimum power \(\lambda _k\)) as follows:

$$\begin{aligned}&S\!E_{C\!R}=\max _{\{P_{k,i}\ge 0 \}}\sum _{k=1}^{K}S\!E_{C\!R}^k, \nonumber \\&\quad s.t.: \nonumber \\&\quad C1: \qquad \quad \!\!\! \sum _{i=1}^{N_t}P_{k,i} \le \lambda _k,\nonumber \\&\quad C2: \qquad \quad \!\!\! \sum _{k=1}^{K}\sum _{i=1}^{N_t}P_{k,i} \le P_t,\nonumber \\&\quad C3: I_{C\!R} = \sum _{k=1}^{K}\parallel \sum _{i=1}^{N_t}h_{0,i}P_{k,i}^{1/2}\parallel ^2 \le I_{th}, \end{aligned}$$
(3)

where \(\lambda _k\) is configurable. Note that if \(\lambda _k=0, \forall k \in [1,K]\), then it becomes the previous unfair antennas power allocation problem.

3 Attention-based deep neural network

3.1 Network structure

We use an attention-based convolutional neural network, i.e., Att-CNN, to address the above-mentioned problem. Because the channel gains obtained from our channel model are extremely small, we use the following equation to normalize the dataset.

$$\begin{aligned} {\hat{h}}_{i,j}\triangleq \dfrac{\log _{10}(h_{i,j})- {\mathbb {E}}[\log _{10}(h_{i,j})]}{\sqrt{{\mathbb {E}}[(\log _{10}(h_{i,j}) -{\mathbb {E}}[\log _{10}(h_{i,j})])^2]}}. \end{aligned}$$
(4)

To process the complex numbers of channel gains, a data initialization network is used, see Fig. 3. We define the block channel matrix as \({\varvec{H}}\triangleq \left[ {\varvec{h}}_1^T,\cdots ,{\varvec{h}}_K^T\right] ^T\). The normalized channel gains of \({\varvec{H}}\) are input of \(N_i\) full connected layers (FCLs) with \(L_i\) cells to memorize the relationship between real and imaginary parts. Then a cross-channel \(h_0\) attention network accepts the data and outputs a new matrix called \(\hat{{\varvec{H}}}\), which is mixed real and imaginary parts of channel gains and adjusted by cross-channel \({h}_0\). The cross-channel \({h}_0\) attention network is introduced in Sect. 3.2.1.

Then the integrated matrix is input into three networks, see Fig. 3. At the former of every network is a direct channel \(h_{k}\) attention network (HKAN), which is used to reevaluate every row in the integrated matrix, see Sect. 3.2.2.

In Network 1, the output of the HKAN is input into \(N_c\) CNN layers. The CNN can decrease the number of neural network parameters because of parameter sharing. Pooling layers are not considered since all data should be used. \(N_c\) is limited by the receptive field that the receptive field of the top most layer should be no larger than the input image region [20], which can be calculated as follows.

$$\begin{aligned} r_{n}\!=\!r_{n\!-\!1} * l_{n}-\left( l_{n}-1\right) *\left( r_{n\!-\!1}-\prod _{i=1}^{n-1} s_{i}\right) , s.t., n\!\ge \!2, \end{aligned}$$
(5)

where \(l_{n}\) \(s_{n}\) \(r_{n}\) are kernel size, stride and receptive field of nth layer, respectively; \(r_{0}=1\); and \(r_{1}=l_{1}\). Figure 4 is an example that the input image size is 10*10 pixels; and the kernel size and stride of the first and second layer both are 4*4 pixels and 1*1 pixel, then the receptive field of second layer is 7*7 pixels. At last, the receptive field of the third layer reaches 10*10 pixels, which covers the whole image. The output of the CNN layers is sent into \(N_{f1}\) FCLs with \(L_{f1}\) cells, and with a softmax layer, we obtain the final output of Network 1 which is seen as \(\frac{P_{k,i}}{P_{k}}\), because it has the same mathematical characteristic with softmax as follows.

Fig. 4
figure 4

Receptive field

$$\begin{aligned} \dfrac{\sum \limits P_{k,i}}{P_k}=1. \end{aligned}$$
(6)

In Network 2, the output of the HKAN is directly input into \(N_{f2}\) FCLs with \(L_{f2}\) cells. Then a FCL with sigmod activation functions is used to produce the final output of Network 2, and this output is denoted as \(\frac{P_{k}}{P_{k}^\prime }\), where \(P_{k}^\prime \) is the upper bound of \(P_{k}\), and \(\frac{P_{k}}{P_{k}^\prime }\) has the same range with sigmod as follows.

$$\begin{aligned} 0<\frac{P_{k}}{P_{k}^\prime }<1. \end{aligned}$$
(7)

Similar with Network 2, the output of the HKAN in Network 3 is input into \(N_{f3}\) FCLs with \(L_{f3}\). Then the output is fed into a softmax and a sigmod layer, respectively. The softmax and sigmod layers output \(\frac{P_{k}^\prime }{\sum {P_{k}^\prime }}\) and \(\frac{\sum {P_{k}^\prime }}{P_t}\), respectively, because they have the same ranges with softmax and sigmod as follows.

$$\begin{aligned}&\sum \limits {\dfrac{{\tilde{P}}_{k}}{\sum {{\tilde{P}}_{k}}}} = 1, \end{aligned}$$
(8)
$$\begin{aligned}&0<\dfrac{\sum {{\tilde{P}}_k}}{P_T-\sum {\lambda _k}}<1, \end{aligned}$$
(9)

where \({\tilde{P}}_k\) is the up bound power which can allocate to user k beside \(\lambda _{k}\). Namely, \(P_k^\prime =\lambda _k+{\tilde{P}}_k\).

Then we can calculate \(P_{k,i}\) as follows:

$$\begin{aligned} P_{k,i} \!=\! \dfrac{P_{k,i}}{P_k}\!\dfrac{P_k}{P_k^\prime }\!\! \left[ \lambda _k\!+\!\dfrac{{\tilde{P}}_k}{\sum {{\tilde{P}}_k}} \dfrac{\sum {{\tilde{P}}_k}}{P_T\!-\!\sum {\lambda _k}}\left( P_T\!-\!\!\sum {\lambda _k}\right) \right] \!\!. \end{aligned}$$
(10)

Thus, we obtain \(P_{k,i}\) which complies with the constraints of C1 and C2. Then, regarding our fair antennas power allocation problem, loss function is designed as follows:

$$\begin{aligned} Loss\! \triangleq \!-\!\sum \limits _{k=1}^{K}\log _2\left( \!\!1\!\!+\!\! \dfrac{\parallel \sum \limits _{i=1}^{N_t}h_{k,i}{\hat{P}}_{k,i}^{1/2} \parallel ^2}{\sigma ^2\!\!+\!\!\sum \limits _{l\ne k}\!\!\parallel \!\!\sum \limits _{i=1}^{N_t}h_{k,i}{\hat{P}}_{l,i}^{1/2} \!\!\parallel ^2\!\!+\xi }\!\right) , \end{aligned}$$
(11)

where \({\hat{P}}_{k,i}\) is calculated by Algorithm 1. Note that \({\hat{P}}_{k,i}=P_{k,i}/(1+[I_{C\!R}/I_{th}-1]^+)\) which makes a penalty to \(P_{k,i}\) when the constraint C3 is violated. However, apparently, \({\hat{P}}_{k,i}\) must not be bigger than \({P}_{k,i}\) which may not meet the constraint C1 in some cases. Thus, in order to find the equilibrium between constraint C1, C2 and C3, we propose the iterative Algorithm 1.

figure a

3.2 Applications of attention mechanism

Attention mechanism is a technique imitating human beings to address problem which focuses on important information from big data. In our problem, on account of the diversity of channel gains \({\varvec{H}}\), the networks may get wrong rules, which should be nonexistent, learning from the limited training sets. The proposed attention mechanism is specific and makes the networks more sensitive to channel gains \({\varvec{H}}\) from its unique structure compared to fully connected networks. We define two kind of attention mechanism: \({\varvec{h}}_0\) and \({\varvec{h}}_k\), which are represented as two neural networks. The cross-channel \(h_0\) attention network takes into account the relationship between the \({\varvec{h}}_0\) and \({\varvec{h}}_k\), and the direct channel \(h_k\) attention network represents the internal relationship of \({\varvec{h}}_k\). The details are reported as follows.

Fig. 5
figure 5

a The cross-channel \(h_0\) attention network. b The direct channel \(h_k\) attention network

3.2.1 The cross-channel \(h_0\) attention network

As shown in Eq. (3), the channel gain between CR antenna i and the PR user \({\varvec{h}}_{0}\) can influence the value of \(I_{C\!R}\), which is included in constraint C3. Hence, the cross-channel \(h_0\) attention mechanism is involved in the initialization network. In other words, we reevaluate the weights of CR channel gains in terms of \({\varvec{h}}_0\). Figure 5a shows the cross-channel \(h_{0}\) attention network. Here, \({\varvec{h}}_0\) is used to produce a matrix \({\varvec{q}}\in {\mathbb {R}}^{{\bar{K}} \times N_t}\), \({\bar{K}}= K/2\) as follows:

$$\begin{aligned} {\varvec{q}}=\sigma \left( {\varvec{W}}^{q}{\varvec{h}}_{0}\right) , \end{aligned}$$
(12)

where \({\varvec{W}}^{q}\) is a \({\bar{K}}\times 1\) weight vector from training, and \(\sigma \) is an activation function. Note that \({\bar{K}}= K/2\) is proposed for reducing computation. In addition, matrix \({\varvec{f}}\in {\mathbb {R}}^{{\bar{K}} \times N_t}\) is taken from CR channel gain matrix \({\varvec{H}}\) as follows:

$$\begin{aligned} {\varvec{f}}=\sigma \left( {\varvec{W}}^{f}{\varvec{H}}\right) , \end{aligned}$$
(13)

where \({\varvec{W}}^{f}\) is the \({\bar{K}}\times K\) parameter matrix. Then, matrix \({\varvec{\alpha }}\in {\mathbb {R}}^{N_t \times N_t}\) is defined as the extent to which the network attends to \({\varvec{q}}\) when adjusting \({\varvec{f}}\). \(\alpha _{i,j}\) can be obtained as follows:

$$\begin{aligned} \alpha _{i, j}\triangleq \frac{\exp \left( score_{i j}\right) }{\sum _{i=1}^{N_t} \exp \left( score_{i j}\right) }, \end{aligned}$$
(14)

where \({\varvec{score}}={\varvec{q}}^{\mathrm {T}}{\varvec{f}}\in {\mathbb {R}}^{N_t \times N_t}\). Then, we can obtain a new channel gain matrix with the cross-channel \(h_0\) attention mechanism written as \({\hat{\varvec{H}}}\triangleq \left[ {\hat{\varvec{h}}}_{1}^T, {\hat{\varvec{h}}}_{2}^T,\cdots ,{\hat{\varvec{h}}}_{j}^T,\cdots , {\hat{\varvec{h}}}_{K}^T\right] ^T\in {\mathbb {R}}^{K \times N_t}\), where

$$\begin{aligned} {\hat{{\varvec{h}}}}_{j}={\varvec{h}}_{j}{\varvec{\alpha }}. \end{aligned}$$
(15)
Fig. 6
figure 6

Comparison of fair and unfair antennas power allocation with different parameters

3.2.2 The direct channel \(h_{k}\) attention network

Equation (1) indicates that its value is relevant to \(h_{k,i}P_{k,i}^{1/2}\) and \(h_{l,i}P_{k,i}^{1/2}\). This means the channel gain relationship among users can influence the result of \(S\!E\). Therefore, we design the direct channel \(h_{k}\) attention network. Note that because of the different functions of Network 1, 2 and 3, they do not share the same direct channel \(h_{k}\) attention network. In addition, for Network 1, individually using CNN ignores the relationships between non-adjacent rows in the \({\varvec{H}}\), the direct channel \(h_{k}\) attention network can support CNN to provide a better performance. The network structure of \(h_{k}\) attention network is similar \(h_{0}\) attention network. Noted that, the input of the network is \({\hat{{\varvec{H}}}}\) (which gets off \(h_0\) attention network) and matrix \({\hat{{\varvec{q}}}}\) is produced from \({\hat{{\varvec{H}}}}\) itself instead of \({\varvec{h}}_{0}\). In addition, we introduce a new matrix \({\hat{{\varvec{v}}}}\). The matrices \({\hat{{\varvec{q}}}}\in {\mathbb {R}}^{K \times N_t}\), \({\hat{{\varvec{f}}}}\in {\mathbb {R}}^{K \times N_t}\), \({\hat{{\varvec{v}}}}\in {\mathbb {R}}^{K \times N_t}\) can be described as follow:

$$\begin{aligned} \left\{ \begin{aligned} {\hat{{\varvec{q}}}}=\sigma ({\hat{{\varvec{W}}}}^{q}{\hat{{\varvec{H}}}}),&\\ {\hat{{\varvec{f}}}}=\sigma ({\hat{{\varvec{W}}}}^{f}{\hat{{\varvec{H}}}}),&\\ {\hat{{\varvec{v}}}}=\sigma ({\hat{{\varvec{W}}}}^{v}{\hat{{\varvec{H}}}}),&\\ \end{aligned} \right. \end{aligned}$$
(16)

where \({\hat{{\varvec{W}}}}^{q}\), \({\hat{{\varvec{W}}}}^{f}\), \({\hat{{\varvec{W}}}}^{v}\) are \(K\times K\) weight matrices. Similar to Equation (14), \(\beta _{i,j}\) can be obtained by the inner product of extracted vectors (16) as follows:

$$\begin{aligned} \beta _{i,j}\triangleq \frac{\exp \left( [{\hat{{\varvec{q}}}}]_i^T [{\hat{{\varvec{f}}}}]_j\right) }{\sum _{i=1}^{N_t} \exp \left( [{\hat{{\varvec{q}}}}]_i^T [{\hat{{\varvec{f}}}}]_j\right) }. \end{aligned}$$
(17)

where \([\hat{{\varvec{q}}}]_i\) and \([\hat{{\varvec{f}}}]_j\) are the ith and jth columns of \(\hat{{\varvec{q}}}\) and \(\hat{{\varvec{f}}}\), respectively. We define \({\varvec{{\tilde{H}}}}\triangleq \left[ {\varvec{{\tilde{h}}}}_{1}^T,{\varvec{{\tilde{h}}}}_{2}^T,\cdots , {\varvec{{\tilde{h}}}}_{j}^T,\cdots ,{\varvec{{\tilde{h}}}}_{K}^T\right] ^T\in {\mathbb {R}}^{K \times N_t}\) as the output of the direct channel \(h_{k}\) attention network, where

$$\begin{aligned} {\tilde{{\varvec{h}}}}_{j}= {\varvec{v}}^{\mathrm {T}}_{j}{\varvec{\beta }}. \end{aligned}$$
(18)

4 Experiment

Table 2 Att-CNN parameters

We built a channel model in terms of [21] to produce channel gains, which takes path loss and multi-path fading effects into account. The following configuration is used. Path loss exponent is set 2.5, distance between CUs/PU and CR/PR base stations is a uniformly distributed random variable in the range \(\left[ 10,200\right] \). Using this channel model, we produced a data set which has 1000 \(10\times 100\) matrices for the following Sects. 4.1 and 4.2. We assumed \(N_t=99\) antennas and \(K=9\) CR users, and in every \(10\times 100\) matrix, the \(h_{k,i}\) elements occupy the first 9 rows and 99 columns; the \(h_{0,i}\) elements occupy the last row index (column index 1 to 99); the \(g_k\) elements occupy the last column index (row indices 1 to 99) and the element of \(g_0\) occupies the last column index and last row index. 90% of the data set was used for training, 10% for testing. Such training/testing is well used in data mining and machine learning domains [22,23,24]. The noise was generated as circular symmetric complex Gaussian random variable with zero-mean and unit variance. Table 2 lists the parameters of our Att-CNN. In addition, we assume \(N_d\), epoch, batch size and learning rate of 20, 100, 100 and 0.005, respectively. The following experiments are conducted in a PC which has a 3.8 GHz AMD-R5-2600 CPU, a GeForce RTX 2060 with a 6 GB frame buffer and 16 GB RAM.

4.1 Fair antenna power allocation

To validate our fair antennas power allocation method, we conducted experiments to compare the results of fair and unfair antennas power allocation. We assume that \(P_{t}=1e-2\) and \(\lambda _{k}=1e-4\), and \(I_{th}\) varies from \(1e-6\) to \(1e-8\). Figure 6 shows comparisons of fair and unfair power allocation over users and antenna elements. Three sub-boxes (sub-plots) are exhibited. Each sub-plot represents one experiment/comparison for a different configuration. The fair scheme (left) heat map is compared to the unfair scheme (right) in every sub-plot. Horizontal and vertical axes are the users and antennas, respectively. One pixel means allocated specific power on an antenna to a CR user. Color depth represents the allocated power value such that the closer to the dark blue is the bigger. It is obvious that the fair model assigns power to all users without focusing all power on particular indices. However, the unfair model focuses the power assignment on a single CU index.

Fig. 7
figure 7

A fully connected neural network which only replaces the attention and convolutional parts with three ReLU layers in Att-CNN

Fig. 8
figure 8

Training and testing results of the Att-CNN and FNN

Fig. 9
figure 9

\(S\!E\) versus \(S\!N\!R_P\) when \(Ith=1e-6\), \(P_{P\!R}=1e-4\) and \(\sigma =1e-9\)

Fig. 10
figure 10

\(S\!E\) versus \(I\!N\!R\) when \(P_t=1e-4\), \(P_{P\!R}=1e-4\) and \(\sigma =1e-9\)

Fig. 11
figure 11

Complexity analysis for FNN and proposed Att-CNN

4.2 Power allocation performance

As a typical benchmark, we implemented an equal power method (EPM), in which \(P_{k,i}\) can be calculated by \(P_{k,i} = P_t/N_tK/I_d\). Note that the division over \(I_d\) is meant for meeting the constraint C3 in Eq. (3). Then, we implemented a FNN to validate the novelty of the proposed attention-based CNN, i.e., Att-CNN, from the perspective of neural network structures. Figure 7 shows the structure of the FNN. The data initialization network and results calculation network keep the same with the Att-CNN, and three ReLU layers with 891 neurons are used to replace the attention networks and convolutional layers.

To compare the performance of antennas power allocation methods, we defined two signal-to-noise ratios as follows.

$$\begin{aligned} \left\{ \begin{array}{l} S\!N\!R_P\triangleq \dfrac{P_{t}}{\sigma ^2},\\ I\!N\!R\triangleq \dfrac{I_{th}}{\sigma ^2}. \end{array} \right. \end{aligned}$$
(19)

Then we conducted two experiments using the EPM, Att-CNN and FNN, where we changed the \(S\!N\!R_P\) and \(I\!N\!R\) from 0 to 50 dB to observe changes of SE. We assumed that \(\sigma ^2=1e-9\), \(P_{P\!R} = 1e-4\) and \(\lambda _k = 0\) in these two experiments.

Figure 8 compares training effect between the Att-CNN and FNN. Around 10 epoch, both methods can get a good results. The FNN has a more serious overfitting than the Att-CNN. More parameters from FNN and limited training sets may cause this issue. After we added the attention mechanism and CNN into the network structure, the Att-CNN got better \(S\!E\) and reduced overfitting significantly.

Figure 9 shows the \(S\!E\) against \(S\!N\!R_P\), when \(I_{th}=1e-6\). The gain gap between Att-CNN and FNN is larger when the \(S\!N\!R_P\) increases, up to 0.588 b/s/Hz. The EPM is the worst which is even smaller than FNN by a gain gap up to 0.934 b/s/Hz.

Figure 10 shows the \(S\!E\) versus \(I\!N\!R\), when \(P_{t}=1e-4\). It is clear that our proposed Att-CNN always outperforms than FNN and EPM when \(I\!N\!R\) varies from 0 to 50 dB. The Att-CNN has a superiority over FNN and EPM, up to 0.596 and 1.586 b/s/Hz, respectively.

4.3 Computational performance

The proposed Att-CNN not only has better power allocation performance, but also has better computational performance than the FNN. We use floating point operations, i.e., FLOPs, and number of parameters to evaluate computational performance.

Figure 11a shows FLOPs versus the number of users in the range between 2 and 9, and the number of antennas is fixed to 99. Only one case (i.e., the number of users is less than 3) is that the FNN has less FLOPs than Att-CNN. When there are more than three users, the FLOPs of the FNN increase sharply, while Att-CNN increases slowly. Figure 11b shows FLOPs against the number of antennas with 9 users. The Att-CNN has less neural network parameters than the FNN when the number of antennas is more than 49, which also increase slowly. Note that we assume our system as a massive MIMO system, hence cases with more antennas are more reasonable.

Figure 11c, d shows number of parameters against the number of users, and the number of antennas, respectively. They almost have the same trends with above FLOPs analysis.

5 Related work

5.1 Convolutional neural network

Deep learning has dramatically improved the novelties in many fields, such as speech recognition, visual object recognition, object detection [25, 26]. Deep learning can be represented in different structures, e.g., fully connected neural network (FNN), recurrent neural network (RNN) and convolutional neural network (CNN). Among them, CNN have been widely studied in many fields.

Fukushima [27] provided a pioneering research on CNN in 1980, wherein he proposed a neocognitron model concluded convolution and pooling layers. Lecun et al. [28] applied back propagation in their LeNet-5 which became the prototype of CNN. In 2012, Hinton et al. [29] improved the performance of CNN in image recognition with Alexnet which used deep structure and dropout method. Based on the previous researches, Lecun et al. [30] improved error rate to 11% by their Dropconnect. Later on, the error rate increased to 6.7% by Yan et al. [31] where they proposed a flexible CNN structure, called Network in Network. Besides image classification, examples on applications applied for CNN are: object detection [32], fault prediction [33], natural language processing [34] and so on. However, none of CNN researches about antennas power allocation are found.

5.2 Attention mechanism

Sutskever et al. [35] proposed a sequence to sequence model in 2014, which has two problems: 1) When sentences are too long, the model performance will sharply decrease; 2) Different words in every sentence have the same priority. These problems are also mentioned in computer vision by Mnih et al. [36]. To address these issues, Bahdanau et al. [37] proposed attention mechanism, which implemented soft attention and provided some visual experimental figures. Later on, Xu et al. [38] applied the attention mechanism into computer vision, where they proposed two types of attention mechanisms: soft and hard attentions. In 2015, Luong et al. [39] improved these two attentions to the other two: local and global attentions. In 2017, Ahmed et al. [40] proposed a new network structure, named transformer which included a self-attention mechanism. The attention mechanisms have many applications. For instance, Li et al. [41] leveraged the attention mechanism to focus on the objects clicked by users in recommended systems based on session process; Liu et al. [42] proposed a gated multilingual attention framework to address the issue of data sparsity and monolingual ambiguity in event detection tasks; Du et al. [43] proposed a new self-attention mechanism which leveraged the relationships between local features and applied in some industrial video data. Min et al. [44] proposed bottom-up and top-down attention mechanism can assign higher weights to sensitive features, thereby limiting some redundant information. Up to authors’ best knowledge, none of the state-of-the-art has introduced the attention mechanism in the radio resource management context, particularly the fair power assignment task. The novel use of the attention mechanism in this work is motivated by its ability that allows for input features to dynamically come to the forefront as needed [38]. Such a characteristic is inspired from the related literature above.

5.3 Resource allocation

From the resource allocation perspective, machine learning methods have provided fast processing and can resolve optimization problems for time sensitive tasks. Those tasks are treated as a black box given that the relationship between the input and output is learnable via deep neural network [25]. Without loss of generality, deep neural network can approximate and solve non-convex optimizations known to be NP-hard [45]. In non-CR context, authors of [46] have developed an approximation for a weighted MMSE optimization by a deep neural network. The aim was to provide less computational complexity and hence real-time processing at almost similar performance. In the CR context, Zhou et al. [47] have implemented a deep neural network for resource allocation, namely spectral efficiency and energy efficiency maximizations. The training data have been obtained by specific conventional strategies in the literature. Such dependency on conventional algorithms raises the computational complexity. Liu et al. [48] have employed weighted sum of CR interference power as objective for a minimization function with quality of service constraints for both CR and PR networks. Since the problem is non-convex, a message-passing algorithm based on deep learning has been utilized. A spectral efficiency maximization problem for a set of transceiver pairs has been solved by a FNN in [49]. Device-to-device communication links were considered and the FNN was trained by data set for which normalization has been conducted for timely efficiency. This model doesn’t address infrastructure-based networks nor fairness constraints which are vital requirements in real networks.

6 Conclusion

The paper built a mathematical model for an unfair antennas power allocation issue in MIMO systems. Then, an attention-based convolutional neural network, i.e., Att-CNN, was proposed to address the issue. There, an \(h_0\) attention network was used to reevaluate the weights of channel gains in terms of \(h_0\); an \(h_k\) attention network was used to change the weights of the channel gains among users; and a CNN was applied to decrease floating point operations (FLOPs) and number of network parameters. We used heap maps to compare fair and unfair allocation varying the interference thresholds, which verified the fair effect. To validate power allocation performance of our Att-CNN, we compared it with equal power allocation and a fully connected neural network (FNN) from the aspects of spectral efficiency against \(S\!N\!R_P\) and \(I\!N\!R\). At last, we analyzed the FLOPs and number of parameters of the Att-CNN and the FNN, which indicated the superiority of our Att-CNN.

In future research, we intend to further improve the computational performance of our Att-CNN and apply it in realistic industrial systems.