1 Introduction

Past few decades have witnessed the explosive increase of digital devices and the exponential growth of data rate in modern communication systems. Thanks to predecessors’ unremitting efforts, the data rate of wireless communication systems has been drastically improved from 9.6 Kbps of the Global System for Mobile (GSM) for 2G to 1 Gbps of the 3rd Generation Partnership Project Long Term Evolution Advanced (3GPP LTE-A) for 4G [1,2,3]. Compared to 4G, 5G communications aim to achieve 1000 times system capacity, 10 to 100 times spectral efficiency, energy efficiency and data rate, and 25 times average cell throughput [4, 5].

In order to fulfil those challenging requirements, advanced techniques focusing on new air-interface, coding, modulation, waveform, and multiple-access have been proposed recently. Among those key techniques, massive multiuser multiple-input multiple-output (MU-MIMO) system has captured more attentions from and brought more imaginations to researchers and engineers [6, 7]. In a common assumption, massive MU-MIMO systems employ a few hundreds of antennas at base station (BS), simultaneously serving many tens of terminals via the same time-frequency resource. This emerging technique scales up MIMO by possibly orders of magnitude, and achieves significant advantages in increased data rate, enhanced reliability, improved energy efficiency, and reduced interference [8,9,10].

Unfortunately, the aforementioned benefits would not become apparent until the complexity and interference challenges brought by the system-scale growth have been well addressed. It has been proven in literature that the optimal MIMO detection is NP-hard [11, 12]. And the computational complexity of such optimal algorithms based on either maximum-likelihood (ML) criterion or maximum a posteriori (MAP) [13,14,15] criterion quickly becomes excessive as the number of decision variables increases. Though linear detections can achieve a good tradeoff between performance and complexity when the system-scale is small, such algorithms as zero-forcing (ZF) and minimum-mean square-error (MMSE) [16, 17] still suffer from the computation-hungry matrix inversions for massive MIMO scenarios.

Being smarter, one should resort to some recent proposed iterative linear detectors alternatively. While these detectors try to strike a lower complexity in the theoretical sense, the following issues remain unconsidered in the existing literature and hinder them from being adopted by practical systems.

  • Convergence Issue: In order to be suitable for practical 5G applications, an iterative linear detector should not only consider the ideal channel but also the scenarios where the numbers of transmitting and receiving antennas are comparable or the antenna correlations are not negligible. However, for those cases, the spectral condition number (SCN) of the regularized Gram matrix will increase, and highly affects the convergence rate as shown later. In extreme cases, even sufficiently large iteration number could not guarantee satisfactory performance.

  • Complexity Issue: The complexity issue comes from two aspects: i) To achieve comparable performance as conventional linear detectors like MMSE, the slowed convergence will result in more iterations, or in other words, higher complexity and latency. ii) To speedup the convergence, the commonly adopted precondition relies on the incomplete Cholesky decomposition [18]. But for massive MU-MIMO, its pre-processing for the threshold will easily leads to excessive complexity. In summary, both aspects will offset or even eliminate the complexity advantage of iterative linear detectors, and such a balance point in between is pretty difficult to figure out in general.

1.1 Existing Relevant Works

In this subsection, the existing relevant algorithms and implementations for massive MU-MIMO detection are reviewed.

1.1.1 Iterative Linear Detectors

For a large-scale MU-MIMO system, the linear detectors with exact matrix inversion tend to cost an unaffordable complexity as the user number grows. To bypass the exact matrix inversion, many works focused on the Neumann series expansion (NSE) for approximation purpose [19,20,21,22,23,24,25,26,27]. However, when the number of NSE terms is greater than 2, it again suffers from considerable complexity. Recently, iterative linear algorithms, such as conjugate gradient (CG) [28,29,30,31,32], Gauss-Seidel (GS) [33,34,35], successive over-relaxation (SOR) [36,37,38,39,40,41], and others [42,43,44,45] have been considered for massive MIMO detection for a better balance of complexity and performance. Constrained by the inherent properties, the convergence rate heavily depends on: i) the system loading factor α, and ii) the channel correlation coefficient ζ. First, α indicates that with a fixed number of antennas, the BS cannot serve a fast growing number of users simultaneously. It is a pity that the widely adopted antenna configuration in literature is 128 × 8 or 16 (BS v.s. users), and the corresponding detection methods failed to take those many-user cases into consideration. Second, for correlated channels with high ζ, performance of the SOA iterative linear detectors appear unstable. Third, the involved max-log-LLR computation related to either large-scale matrix inversion or iterative processing, will lead to high complexity and long latency. Therefore, an iterative linear detector which successfully combines stable convergence-rate and low complexity for a wide range of scenarios, is highly desired for 5G applications.

1.1.2 Precondition Processor

The conventional solution to improve the convergence rate is introducing preconditionor. The key design issue for the precondition processor is the sparse-matrix decomposition, which is complex and laborious. However, to the best knowledge of the authors, very few works have been reported for accelerating sparse-matrix decomposition. According to benchmark study, existing accelerations of matrix decomposition were achieved by exploring new memory-access methods [46,47,48,49]. However, those works mainly focused on conventional LU decomposition, which is not suitable for sparse matrix. Therefore, sparse-matrix decomposition with a minimum hardware cost is highly expected.

1.2 Contributions of This Paper

  • Algorithm Innovation: To this end, this paper first explores the convergence properties of iterative linear solvers. Based on those properties, an adaptive precondition scheme for general iterative linear detections in massive MU-MIMO uplink is proposed. This scheme ensures a fast convergence rate for common iterative linear detectors with specific consideration on complexity reduction. Adaptivity of the proposed precondition lies in the threshold selection, which is based on two critical parameters influencing the convergence rate:

    1. 1.

      User-to-BS antennas ratio (also referred to as the system loading factor) α.

    2. 2.

      Channel correlation coefficient ζ.

    This threshold assists us to accelerate the convergence with acceptable overhead complexity, and is valid not only for the steepest descent (SD) detection but also the general iterative linear methods and scenarios. The convergence proof of this preconditioned detection is given in Section 3.

  • Architecture Innovation: For the implementation issue, a novel implementation methodology for the adaptive preconditioned SD (APSD) detection is first proposed with iteration bound analysis, timing rescheduling, and architecture optimization. A novel incomplete decomposition for sparse matrix along with its hardware architecture is also proposed for complexity purpose. The proposed APSD detector architecture has been implemented with FPGA. The performance and complexity comparisons with the SOA detectors clearly demonstrate the advantages of our design. It should be noted that this implementation methodology is also applicable for other adaptive preconditioned detections.

1.3 Paper Outline

The remainder of this paper is organized as follows. Section 2 briefly describes massive MU-MIMO system model and MMSE detection scheme. Section 3 gives the preconditioned SD-based soft-output detection and its convergence proof. Hardware design for the proposed detection is presented in Section 5. FPGA implementation results and comparison with the SOA designs are given in Section 6. Finally, Section 7 concludes the entire paper.

1.4 Notations

In this paper, lowercase and upper bold face letters stand for column vectors and matrices, respectively. The operations (▪)T, \(\overline {(\centerdot )}\), (▪)H, and \(\mathbb {E}\{\centerdot \}\) denote the operations of transpose, complex conjugate, conjugate transpose, and expectation, respectively. The entry in the i-th row and j-th column of A is A (i,j). The vector a in each iteration j is a j . The A-norm of a vector ν with respect to the matrix A is defined as \(\|\mathbf {\nu }\|_{\mathbf {A}}\triangleq (\mathbf {A} \mathbf {\nu } \cdot \mathbf {\nu })^{1/2}\).

2 Linear Detection Model for Massive MU-MIMO Uplink

In this paper, to fully illustrate the advantages of the proposed scheme, its performance has been carefully evaluated for different channel scenarios of massive MU-MIMO uplink system. In this paper, a system with loading factor \(\alpha =\frac {U}{B}\) is assumed, where a BS equipped wtih B antennas serves U single-antenna users simultaneously via a flat-fading channel. Therefore, the received symbol vector \(\mathbf {y} \in \mathbb C^{B\times 1}\) at the BS can be modeled as:

$$ \mathbf{y} = \mathbf{H} \mathbf{s} + \mathbf{n}, $$
(1)

where \(\mathbf {s} \in \mathbb C^{U\times 1}\) denotes the collection of transmitted symbols at the users side and \(\mathbf {n} \in \mathbb C^{B\times 1}\) denotes the independent identically distributed (i.i.d.) additive white Gaussian noise (AWGN) with zero mean and variance N 0. \(\mathbf {H} \in \mathbb C^{B \times U}\) is the B × U i.i.d. flat-fading channel response matrix with zero mean and unit variance. Details of H will be discussed below.

2.1 Correlated Channel Model

For better evaluation, the antenna correlation has been taken into account in this paper. More specifically, we adopt the Kronecker model in [50], and H can therefore be expressed as \(\mathbf {H} = \mathbf {R}_{r}^{1/2}\mathbf {T} \mathbf {R}_{t}^{1/2}\), where \(\mathbf T \in \mathbb C^{B \times U}\) is the B × U i.i.d. channel matrix with zero mean and unit variance. β represents the power attenuation caused by the large-scale fading (path-loss and shadowing). \(\mathbf {R}_{r} \in \mathbb C^{B \times B}\) and \(\mathbf {R}_{t} \in \mathbb C^{U \times U}\) are the spatial correlation matrices at BS and user side, respectively. Their entries are given as follows:

$$ \mathbf{R}_{r} (i,k) = \left\{ \begin{aligned} &\left( \zeta_{r} e^{j\theta} \right)^{k-i}, & {i \leq k},\\ &\mathbf{R}^{*}_{r} (k,i), & {i>k}; \end{aligned} \right. $$
(2)
$$ \mathbf{R}_{t} (i,k) = \left\{ \begin{aligned} &\left( \zeta_{t} e^{j\theta} \right)^{k-i}, & {i \leq k},\\ &\mathbf{R}^{*}_{t} (k,i), & {i>k}. \end{aligned} \right. $$
(3)

R(i,k) denotes the entry in the i-th row and k-th column. \(\mathbf {R}_{r}^{*}\) and \(\mathbf {R}_{t}^{*}\) are the conjugate matrices of R r and R t , respectively. Furthermore, we assume each cell with a free space reference distance d 0 covers an area with a radius of r, in which the users are randomly distributed. The distance between the i-th specific transmitter-receiver pair is d i . The large-scale fading effect is jointly determined by the average path-loss \(\left (\frac {d_{0}}{d_{i}}\right )^{\gamma }\) and the log-normal shadowing \(10log_{10}\psi \sim \mathcal {N}(0,\sigma ^{2}_{\psi })\) in a form of power attenuation \(F(i,i)=\left (\frac {d_{0}}{d_{i}}\right )^{\gamma }\psi \). To fully evaluate different massive MIMO detectors, we consider four common scenarios:

  1. 1.

    Uncorrelated: When correlations at BS and user side are eliminated (i.e. both ζ r and ζ t are zero), \(\mathbf {R}_{r}^{1/2}\) and \(\mathbf {R}_{t}^{1/2}\) degrade to I B and I U , respectively. And for fair comparison with the benchmarks, the large-scale fading is ignored. In this case, matrix H is actually the ideal i.i.d. Rayleigh fading channel matrix.

  2. 2.

    BS Correlated: For single-antenna users, the correlation among users is negligible. But for BS, correlation may exist due to the large-scale antenna array. Thus, the channel matrix \(\mathbf {H} = \mathbf {R}_{r}^{1/2} \mathbf {T} \mathbf {F}_{t}^{1/2}\), where F t = d i a g(F t (1,1),F t (2,2),…,F t (U,U)). F t (i,i) is the power attenuation factor between the i-th transmission stream and the BS.

  3. 3.

    User Correlated: For multi-antenna users, if the distance between two BS antennas is large enough (lager than half-wavelength), correlation between BS antennas can be omitted. Then \(\mathbf {R}_{t}^{1/2}\) becomes diagonal matrix F r , and \(\mathbf {H}=\mathbf {F}^{1/2}_r\mathbf {T}\mathbf {R}_{t}^{1/2} \), where F r = d i a g(F r (1,1),F r (2,2),…,F r (B,B)). F r (i,i) represents the power attenuation between the i-th receiving antenna at the BS and the users.

  4. 4.

    Fully Correlated: In this case, correlations take place at both sides. The matrix \(\mathbf {H}=\mathbf {R}_{r}^{1/2} \mathbf {T} \mathbf {R}_{t}^{1/2}\), where \(\mathbf {R}_{r}^{1/2}\) and \(\mathbf {R}_{t}^{1/2}\) are given by Eqs. 2 and 3, respectively.

2.2 MMSE Equalization Scheme

At the BS side, to minimize the mean squared error (MSE) of s, the MMSE equalizer [51] is employed:

$$ \hat{\mathbf{s}}=\left( \mathbf{H}^{H}\mathbf{H}+N_{0}E_{s}^{-1}\mathbf{I}\right)^{-1}\mathbf{H}^{H}\mathbf{y}, $$
(4)

where I is the identity matrix with dimension U and \(E_s=\mathbb {E} [\|s_{n}\|^2]\) is the average transmission power. For convenience, we denote

$$ \left\{ \begin{array}{ll} \tilde{\mathbf{y}} &= \mathbf{H}^{H} \mathbf y,\\ \mathbf{A} &= \left( \mathbf{H}^{H}\mathbf{H}+N_{0}E_{s}^{-1}\mathbf{I}\right). \end{array} \right. $$
(5)

Then Eq. 4 can be simplified as

$$ \hat{\mathbf{s}} = \mathbf{A}^{-1} \tilde{\mathbf{y}}. $$
(6)

To exactly calculate \(\hat {\mathbf {s}}\) , we have to deal with the matrix inversion with complexity of \(\mathcal {O} (U^3)\). As U grows, the complexity may become prohibitive.

2.3 Basic Assumptions

The following assumptions are made throughout the rest of this paper. First, we assume the channel fading matrix H as well as the correlation coefficients ζ r and ζ t are perfectly known at the receiver. Second, half-rate convolutional code is employed to mitigate the channel fading. Last but not least, the per-bit LLR is directly computed after the channel equalization and the results are passed to the standard Viterbi decoder [52].

3 Adaptive Preconditional Iterative Linear Detection

In this section, we start with the convergence theory of canonical iterative methods regarding different channel conditions. Based on the theoretical analysis, an adaptive preconditioned method to improve the convergence is proposed. Then, a more efficient LLR computation method is introduced for further complexity reduction.

3.1 Convergence Analysis for Different Channel Conditions

The MMSE equalization in Eq. 6 can be rewritten as \(\mathbf {A} \hat {\mathbf {s}} = \tilde {\mathbf y}\), which is a typical linear system equation with a Hermitian positive definite (HPD) coefficient matrix A. Here, we focus on the convergence issue of iterative linear solvers. Specific concerns have been put on cases when the system loading factor α approaching 1 and the correlation is inevitable.

Definition 1

The SCN of a normal matrix A (with the respect to 2 norm) is denoted by:

$$ \kappa(\mathbf{A}) \triangleq \frac{\lambda_{\max}(\mathbf{A})}{\lambda_{\min}(\mathbf{A})}, $$
(7)

where λ max(A) and λ min(A) are the greatest and least eigenvalues of A, respectively.

Indicated by [53], the mathematical principle behind is that, for HPD problems the convergence rate of iterative linear algorithms is highly related to the SCN of matrix A. Usually, higher SCN leads to lower convergence rate. Take the commonly used steepest descent (SD) algorithm as an example, in its i-th iteration, the upper bound of estimation error is:

$$ \|\hat{\mathbf{s}} - \hat{\mathbf{s}}^{(j)} \|_{\mathbf{A}} \leq \frac{\kappa-1}{\kappa+ 1}\|\hat{\mathbf{s}} - \hat{\mathbf{s}}^{(j-1)}\|_{\mathbf{A}}. $$
(8)

Without loss of generality, SD algorithm is employed as a running example in the followingFootnote 1. Therefore, we aim to identify efficient approach to reduce the SCN of A for convergence consideration. The analysis in [55] implies two points:

  1. 1.

    [α issue]: For an uncorrelated Rayleigh fading environment, increasing the antenna number at the BS reduces the probability of experiencing an ill-conditioned channel (system with large SCN).

  2. 2.

    [ζ issue]: The presence of spatial correlation tends to increase the probability of system’s experiencing a highly ill-conditioned channel.

According to the aforementioned observations, we have the following conclusion:

Corollary 1

For massive MIMO channel H with loading factor α(H)and correlation degree ζ(H), the SCN of the regularized Gram matrix A , κ(A)is bounded by the relations: κ(A) ∝ α(H)and κ(A) ∝ ζ(H).

Proof

See Appendix A. □

Since the minimum SCN of a normal matrix is one, iterative linear algorithms can achieve a high convergence rate if A is an identity matrix. This observation inspires us to explore adaptive precondition to lower the range of A’s SCN within the given boundaries.

3.2 Adaptive Precondition to Accelerate Convergence

As illustrated above, higher loading factor α and channel correlation ζ introduce an ill-conditioned MMSE detection problem. To improve the convergence rate of iterative detection for this problem, an efficient precondition based on incomplete LDLT (ILDLT) decomposition is proposed to lower the SCN for the linear equation \(\mathbf {A} \hat {\mathbf {s}} = \tilde {\mathbf y}\). Details of this precondition algorithm are given in Algorithm 1 as follows:

figure h

It should be noted that if ij = 0, (▪)(i,j) = 0. Here, η is a function of ζ r , ζ t , and α. The explicit expression of η in this work is obtained from the analysis in Section 3.1 with a heuristic perspective:

$$ \eta= (1-\zeta_{r})(1-\zeta_{t})(1-\alpha)\mathbf{A}_{(i, i)}. $$
(9)

With L and D obtained from Algorithm 1, we modify the original linear equation \(\mathbf {A} \hat {\mathbf {s}} = \tilde {\mathbf y}\) into

$$\begin{array}{@{}rcl@{}} \mathbf{M}\mathbf A\hat{\mathbf{s}}=\mathbf{M}\bar{\mathbf{y}}. \end{array} $$
(10)

The new coefficient matrix M A is defined by Eq. 11. It is an approximation of identity matrix (i.e. its SCN is close to one).

$$\begin{array}{@{}rcl@{}} \mathbf{M}\mathbf{A}=\mathbf{L}^{-1}\mathbf{D}^{-1}(\mathbf{L}^{H})^{-1}\mathbf A. \end{array} $$
(11)

Numerical results in Fig. 1 show consistency with Corollary 1 and the efficiency of proposed adaptive precondition.

Figure 1
figure 1

An illustrative SCN comparison for the matrices A and M A. Here, SNR = 10 dB and B = 128. Thanks to the proposed adaptive precondition approach, comparing the circle markers to the triangle markers, it is evident that the SCN can be reduced by roughly 50% when the matrix A is preconditioned into M A in the linear equation as in Eq. 10 to improve convergence as will be discussed in Section 4.

3.3 Adaptive Preconditioned Iterative Linear Detection

Now, we combine the adaptive precondition approach with SD detection. The proposed APSD detection is shown in Algorithm 2 as follows.

figure i

Here, J is the maximum iteration number. Usually for J = 2 or 3, the estimated result will converge to the final result \(\hat {\mathbf {s}}\). Readers can refer to Appendix B for derivation of Algorithm 2.

3.4 Efficient Approximated LLR Calculation

For finite length and robustness considerations, the soft messages in the above algorithms are in the form of logarithm likelihood ratio (LLR). In this subsection, the efficient and hardware-friendly LLR calculation is proposed. In general, LLR for the b-th bit in the n-th symbol is given by:

$$ \delta_{\hat{s}_{n}, b}\triangleq\log\left( \frac{f(\mathbf{y}|q_{n,b}= 1)}{f(\mathbf{y}|q_{n,b}= 1)}\right). $$
(12)

According to [56], we have

$$ \mathbf{C}=\mathbf A^{-1}(\mathbf{H}^{H}\mathbf{H}). $$
(13)

Substitute (4) into (13), we have

$$ \mathbf{C}=\mathbf{I} - N_{0}E_{s}^{-1}\mathbf{A}^{-1}, $$
(14)

and

$$ \mathbf{C}_{(n,n)} = 1 - N_{0}E_{s}^{-1}\mathbf{A}_{(n,n)}^{-1}. $$
(15)

It has been shown that, M in Eq. 10 is the coarse approximation of A −1 and only the diagonal entries of A −1 are needed. With the diagonal matrix D obtained by Algorithm 1, we simplify (15) as:

$$ \mathbf{C}_{(n,n)} = 1 -\gamma\frac{1}{\mathbf{D}_{n}}, $$
(16)

where \(\gamma =N_0E_s^{-1}\). Assume code bits are equally likely and independently Gaussian distributed. The approximated expression for \(\delta _{\hat {s}_n, b}\) is further derived in Eq. 17, where |⋅| is the absolute value and \(\mathbb {Q}_{b = 0,1}\) stands for the collections of constellations when the b-th bit equals 0 or 1, respectively. The LLR values are finally passed to the Viterbi decoder to recover s without inverting A.

$$\begin{array}{@{}rcl@{}} \delta_{\hat{s}_{n}, b}\,=\,\frac{1}{(1-\gamma\frac{1}{\mathbf{D}_{n}})\gamma\frac{1}{\mathbf{D}_{n}}}\!\left[\underset{q^{0} \in \mathbb{Q}_{b = 0}} {\min}\left\vert \hat{s}_{n} \,-\, (1\,-\,\gamma\frac{1}{\mathbf{D}_{n}}) q^{0} \right\vert^{2}\right.\\\left.-\!\underset{q^{1} \in \mathbb{Q}_{b = 1}} {\min}\left\vert \hat{s}_{n} - (1-\gamma\frac{1}{\mathbf{D}_{n}}) q^{1} \right\vert^{2} \right]. \end{array} $$
(17)

4 Numerical Verification of Performance

Here, numerical results for different massive MU-MIMO scenarios are given to verify the advantages of the APSD detection with approximated LLRs over the SOA baselines. The free space reference distance d 0 = 1 covers an area with a radius of d, in which the users are randomly distributed. The overall large-scale fading effect is jointly determined by the average path-loss \(\left (\frac {d_{0}}{d}\right )^{\gamma }\) and the log-normal shadowing \(10log_{10}\psi \sim \mathcal {N}(0,\sigma ^2_{\psi })\) in a form of power attenuation \(\rho =\left (\frac {d_{0}}{d}\right )^{\gamma }\psi \). Furthermore, the antenna correlation has been taken into account in this paper. “Cholesky” is the traditional Cholesky decomposition scheme. “CG_Exact” and “CG_Approx” are the recently proposed CG-based approaches [28,29,30,31] with the exact coefficient C (n,n) and approximate C (n,n), respectively. “APSD_Exact” and “APSD_Approx” are the proposed APSD detection with exact coefficient C (n,n) and approximate C (n,n). The DVB [171oct, 133oct] rate-1/2 convolutional code, 64 quadrature amplitude modulation (QAM), and B = 128 are considered.

Figure 2 shows the BER performance of proposed APSD detection against aforementioned baselines: i) Fig. 2a focuses on the impact of correlation factor ζ, and ii) Fig. 2b considers the effect of loading factor α. iii) Fig. 2c compares the convergence rates of different schemes.

Figure 2
figure 2

BER performance comparison.

In Fig. 2a, BER performances under different correlation conditions are compared. Specifically, we have: i) User Correlated case (ζ t = 0.2,ζ r = 0), ii) BS Correlated case (ζ t = 0,ζ r = 0.4), and iii) Fully correlated case (ζ t = 0.2,ζ r = 0.4). Here we assume d 0 = 1, γ = 3.5, \(\sigma ^2_{\psi }= 5\) and d i is a random number in [1,100]. Though the Cholesky decomposition scheme (Cholesky) offers the best performance, its performance gain over the proposed scheme is negligible. Furthermore, Cholesky’s complexity is quite huge as shown later. Also, the proposed scheme easily outperforms the recently proposed CG-based algorithm (“APSD_Exact”) in all cases with gain varying from 0.5 dB to 1 dB at BER = 10−3. Now it is concluded that performance degradation resulting from the efficient LLR computation in Section 3.4 is marginal (e.g. less than 0.2 dB than Cholesky scheme). Finally, it is also obsereved that the proposed scheme is quite robust when the correlation factor ζ varies.

In Fig. 2b, uncorrelated scenarios with different loading factors α are considered. When α is small (i.e. α ∈ [0,0.1]), the performances of all the schemes are similar. On the other side of this plot, when α reaches its maximum value 1, both the CG-based algorithms (“CG_Exact” and “CG_Approx”) and the APSD ones (“APSD_Exact” and “APSD_Approx”) become worse. However, when α ∈ [0.2,0.7] the proposed scheme outperforms CG-based algorithms. It is concluded that compared with the SOA approaches the proposed scheme is more suitable for scenarios when the loading factor changes in a wide range.

In Fig. 2c, the convergence property of the proposed scheme are compared with baselines by changing the number of iterations J in uncorrelated scenario. Both “APSD_Approx” and the proposed scheme achieve satisfactory performance, and the gaps from the optimal approach (“Cholesky”) are less than 0.1 dB when J = 2 and 3 at BER = 10−3. This convergence advantage is important for practical applications, which cannot afford too many iterations due to the latency and complexity considerations. On the other hand, the CG-based algorithms (“CG_Exact” and “CG_Approx”) still suffer a obvious degradation compared to “Cholesky” at BER = 10−3 even when J = 3.

To sum up, the proposed APSD detection with approximate LLR computation shows good robustness against varying channel correlation and loading factor. Its convergence rate is also superior over the SOA approaches. For practical massive MU-MIMO scenarios, the proposed APSD detection offers a good option balancing performance and complexity.

5 Hardware Architecture Design

In this section, implementation methodology for adaptive preconditioned iterative linear MIMO detection is proposed. The hardware architecture is given with iteration bound analysis and resource optimization. FPGA implementation has also been done to demonstrated the feasibility of the proposed method. Comparison with the SOA design has shown its advantages and indicated its possible adoption by practical systems.

5.1 Data-Flow Graph Analysis

The straightforward hardware block diagram of Algorithm 2 is illustrated in Fig. 3. It is shown that there are a total of three units involved: i) Initialization Unit, ii) Iteration Unit, and iii) Output Unit.

Figure 3
figure 3

Block diagram of the proposed APSD detector.

For iteration bound analysis, the data-flow graph (DFG) is given in Fig. 4. The mapping relationship between DFG nodes and modules in Fig. 3 is in the legend part. According to Fig. 3, we have:

Figure 4
figure 4

DFG for the proposed APSD detector.

Remark 1

The iteration bound of equals (T 7 + T 5 + T 6). The critical path equals (T 1 + T 2 + T 3 + T 5 + T 8 + T 9). Here, T i is the processing time of Node i.

Proof

In Fig. 4, there are 3 loops, namely l 1, l 2 and l 3:

$$ \left\{ \begin{array}{ll} l_{1}&=\textrm{Node } 7 \rightarrow \textrm{Node } 7,\\ l_{2}&=\textrm{Node } 7 \rightarrow \textrm{Node } 6\rightarrow \textrm{Node } 7,\\ l_{3}&=\textrm{Node } 7 \rightarrow \textrm{Node } 5\rightarrow \textrm{Node } 6\rightarrow \textrm{Node } 7. \end{array} \right. $$
(18)

Therefore, the iteration bound is:

$$ T_{\infty}=\max\{T_{7},T_{7}+T_{6},T_{7}+T_{5}+T_{6}\}=T_{7}+T_{5}+T_{6}. $$
(19)

Also, the critical path is p:

$$ \begin{array}{ll} p&=\textrm{Node } 1\rightarrow \textrm{Node } 2\rightarrow \textrm{Node } 3\\ &\rightarrow \textrm{Node } 5\rightarrow \textrm{Node } 8\rightarrow \textrm{Node } 9. \end{array} $$
(20)

It should also be noted that p is also constrained by l 2 and l 3 due to the dependencies shown in Fig. 4. □

According to Remark 1, the optimized hardware architecture is derived in Fig. 5, where special attentions have been paid to both l 3 and p.

Figure 5
figure 5

Detailed hardware architecture of the proposed APSD detector.

5.2 Initialization Unit

In Initialization Unit, GRAM MATRIX module employs a triangular systolic array of (1 + U)U/2 processing elements (PEs) to compute the lower triangular part of A, namely T A . MATCHED FILTER module computes \(\tilde {\mathbf {y}} = \mathbf {H}^{H} \mathbf y\) with a similar PE. (See [34] for details.) GRAM MATRIX module’s results are pipelined column-by-column to CCS (compressed column storage) module below.

5.2.1 CCS Module

Having the i-th column of T A (T A (:,i)), U CCS modules are triggered in a pipelined sequence. To be more specific, the CCS module associated with the i-th column receives data with 2 × (Ui + 1) input ports (every (Ui + 1) ports are for the real part and imaginary part, respectively). It performs one port shift due to the fact that the output of GRAM MATRIX module is produced by each PE, clock after clock. Then the original data will be output to a comparator and finally stored by RAM in a standard CCS format [57]. This procedure performs the outer for loop of Algorithm 1 (Line 2). The blue box in Fig. 5 gives the architecture.

5.2.2 ILDLT PROC Module

With CCS-formatted T A C , ILDL T PROC module performs Algorithm 1 by a co ntrol unit. The other blocks are: a multiplication accumulator (MAC), a reciprocal unit, and an adder. Its operation flow is in Algorithm 3. The architecture is in the yellow box of Fig. 5.

figure j

5.3 Iteration Unit

Iteration Unit takes care of all the iterative computations. According to Section 5.1, the iteration bound is (T 7 + T 5 + T 6), which is the sum of the processing time of Node 7 (complex adder in Fig. 3), Node 5 (TRI-L SOLVER module), and Node 6 (ITERATION module). Therefore, the core task of this part lies in the design of TRI-L SOLVER module and ITERATION module.

5.3.1 TRI-L SOLVER Module

To solve \(\mathbf {z}^{(j)} =(\mathbf {L}\mathbf {D}\mathbf {L}^H)^{-1} \tilde {\mathbf {y}}^{(j)}\) in Algorithm 2, we split this equation into three equations:

$$ \mathbf{u}^{(j)}=\mathbf{L}^{-1} \tilde{\mathbf{y}}^{(j)}, \mathbf{v}^{(j)}=\mathbf{D}^{-1}\mathbf{u}^{(j)}, \mathbf{z}^{(j)}=\mathbf{L}^{-H}\mathbf{v}^{(j)}. $$
(21)

Since L and L H are both triangular matrices with all-one diagonal entries and D is a diagonal matrix, the computation consisting of complex multiplication, addition (subtraction), and scalar division can be implemented in the similar manner as in ILDL T PROC module. Since there is no schedule overlap between TRI-L SOLVER module and ILDL T PROC module (Fig. 5), folding technique can be also considered.

5.3.2 ITERATION Module

ITERATION module in Fig. 3 calculates Lines 5 to 8 in Algorithm 2. Key operations are matrix-vector multiplication, inner product, scalar-vector multiplication, and reciprocal, where multiplication and reciprocal can be implemented by MAC and LUT, respectively. Its architecture is shown in the green box of Fig. 5

5.4 Output Unit

In Output Unit, \(\hat {\mathbf {s}}^{(J)}\) is obtained via a MAC, by accumulating the products of TRI-L SOLVER module’s result and ITERATION module’s result for J iterations. Then the LLR is calculated according to [58].

5.5 Timing Analysis

The overall schedule of the proposed APSD detector is shown in Fig. 6. Same as [34], GRAM MATRIX module and MATCH FILTER module require (B + 2U − 1) clock cycles and (B + U − 1) clock cycles, respectively. CCS module spends one clock per entry since the first entry of current column is output by GRAM MATRIX module. Thus, it is active from the (B + 1)-th cycle to the (B + 2U)-th cycle.

Figure 6
figure 6

Overall schedule of the proposed APSD detector.

The timing analysis of I L D L T P R O C module is more complicated due to the variety of the sparsity structure of T A C in different scenarios. According to Algorithm 3, the maximum processing time of this unit is T s = (U + U 2)/2 (if no value is dropped from the original matrix). Then, ρ s (≪ 1) is introduced as an assistant parameter to specify the exact cycles spent by ILDL T PROC module. Similar analysis with ρ t (≪ 1) is conducted for TRI-L SOLVER module. If no value is dropped from the original matrix, the maximum processing time T t = (U 2U)/2 + (U − 1).

6 FPGA Implementation and Comparison

To demonstrate the advantages of the proposed APSD detector with the SOA designs, implementation and comparison have been given for 128 × 8 massive MU-MIMO system with Xilinx Virtex-7 XC7VX690T FPGA. 16-bit quantization is adopted for both input and output per complex-dimension. All multipliers are 32-bit. The LUT of reciprocal module has 256 addresses and 8-bit outputs. Performance of these quantization schemes has been verified by the “fixedAPSD_approx” curve in Fig. 2c. The fixed-point simulation result shows negligible loss, which demonstrates the efficiency of the proposed hardware design.

In Table 1, the proposed APSD design is compared with the CG-based method (CGLS) [31] and the approximated NSE method [25]. Same FPGA and antenna configuration (128 × 8) are considered. It is shown that the proposed design achieves higher throughput than the CGLS with a better performance in various scenarios. Though the approximated NSE method can achieve higher throughput, the proposed design has the highest hardware efficiency.

Table 1 FPGA implementation comparison.

It should be noted that, since all our benchmarks did not show their power information, power consumption is not compared in Table 1. More relevant comparison based on overall power consumption will be given in our future work. A similar issue is, since all benchmarks only consider the ideal case, in Table 1 we only list the result of Case 1 in Section 2.1 for fair comparison. In fact, for trade-offs for different correlation coefficients, only processing latency will vary in different cases, which will finally affect the detection throughput.

7 Conclusion

In this paper, adaptive preconditioned iterative linear detection is proposed for correlated massive MU-MIMO uplink. Convergence rate of the algorithm has been carefully studied to balance the complexity and performance. Efficient LLR computation method has also been proposed for further complexity reduction. Theoretical as well as numerical results have demonstrated the advantages of the proposed method over the SOA designs. Implementation design methodology is proposed with architectural and timing optimization. FPGA implementation results of APSD detector have verified its application feasibility for practical massive MU-MIMO systems.