Abstract
Being the enabling technique for 5G wireless communications, massive multiple-input multiple-output (MIMO) system can drastically increase the capacity efficiency. However, a few hundreds of antennas will inevitably introduce notable complexity and therefore hinders its direct adoption. Though the state-of-the-art (SOA) iterative methods such as conjugate gradient (CG) detection show complexity advantage over the conventional ones such as MMSE detection, their convergence rates slow down if the antenna configurations become more complicated. To this end, first this paper devotes itself in exploring the convergence properties of iterative linear solvers and then leverages the proposed adaptive precondition technique to improve the convergence rate. This adaptive precondition technique is general and has been incorporated with steepest descent (SD) detection as a show case. An approximated calculation for log-likelihood ratios (LLRs) is proposed for further complexity reduction. Analytical and numerical results have shown that with the same iteration number, the adaptive preconditioned SD (APSD) detector outperforms the CG one around 1 dB when BER = 10−3. Hardware architecture for the APSD detector is proposed based on iteration bound analysis and architectural optimization for the first time. Architectures for other adaptive preconditioned iterative linear detectors can be easily derived by following similar design flow. Compared with the SOA designs, FPGA implementations have verified the APSD detector’s advantage in balancing throughput and complexity, and guaranteed its application feasibility for 5G wireless.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Past few decades have witnessed the explosive increase of digital devices and the exponential growth of data rate in modern communication systems. Thanks to predecessors’ unremitting efforts, the data rate of wireless communication systems has been drastically improved from 9.6 Kbps of the Global System for Mobile (GSM) for 2G to 1 Gbps of the 3rd Generation Partnership Project Long Term Evolution Advanced (3GPP LTE-A) for 4G [1,2,3]. Compared to 4G, 5G communications aim to achieve 1000 times system capacity, 10 to 100 times spectral efficiency, energy efficiency and data rate, and 25 times average cell throughput [4, 5].
In order to fulfil those challenging requirements, advanced techniques focusing on new air-interface, coding, modulation, waveform, and multiple-access have been proposed recently. Among those key techniques, massive multiuser multiple-input multiple-output (MU-MIMO) system has captured more attentions from and brought more imaginations to researchers and engineers [6, 7]. In a common assumption, massive MU-MIMO systems employ a few hundreds of antennas at base station (BS), simultaneously serving many tens of terminals via the same time-frequency resource. This emerging technique scales up MIMO by possibly orders of magnitude, and achieves significant advantages in increased data rate, enhanced reliability, improved energy efficiency, and reduced interference [8,9,10].
Unfortunately, the aforementioned benefits would not become apparent until the complexity and interference challenges brought by the system-scale growth have been well addressed. It has been proven in literature that the optimal MIMO detection is NP-hard [11, 12]. And the computational complexity of such optimal algorithms based on either maximum-likelihood (ML) criterion or maximum a posteriori (MAP) [13,14,15] criterion quickly becomes excessive as the number of decision variables increases. Though linear detections can achieve a good tradeoff between performance and complexity when the system-scale is small, such algorithms as zero-forcing (ZF) and minimum-mean square-error (MMSE) [16, 17] still suffer from the computation-hungry matrix inversions for massive MIMO scenarios.
Being smarter, one should resort to some recent proposed iterative linear detectors alternatively. While these detectors try to strike a lower complexity in the theoretical sense, the following issues remain unconsidered in the existing literature and hinder them from being adopted by practical systems.
-
Convergence Issue: In order to be suitable for practical 5G applications, an iterative linear detector should not only consider the ideal channel but also the scenarios where the numbers of transmitting and receiving antennas are comparable or the antenna correlations are not negligible. However, for those cases, the spectral condition number (SCN) of the regularized Gram matrix will increase, and highly affects the convergence rate as shown later. In extreme cases, even sufficiently large iteration number could not guarantee satisfactory performance.
-
Complexity Issue: The complexity issue comes from two aspects: i) To achieve comparable performance as conventional linear detectors like MMSE, the slowed convergence will result in more iterations, or in other words, higher complexity and latency. ii) To speedup the convergence, the commonly adopted precondition relies on the incomplete Cholesky decomposition [18]. But for massive MU-MIMO, its pre-processing for the threshold will easily leads to excessive complexity. In summary, both aspects will offset or even eliminate the complexity advantage of iterative linear detectors, and such a balance point in between is pretty difficult to figure out in general.
1.1 Existing Relevant Works
In this subsection, the existing relevant algorithms and implementations for massive MU-MIMO detection are reviewed.
1.1.1 Iterative Linear Detectors
For a large-scale MU-MIMO system, the linear detectors with exact matrix inversion tend to cost an unaffordable complexity as the user number grows. To bypass the exact matrix inversion, many works focused on the Neumann series expansion (NSE) for approximation purpose [19,20,21,22,23,24,25,26,27]. However, when the number of NSE terms is greater than 2, it again suffers from considerable complexity. Recently, iterative linear algorithms, such as conjugate gradient (CG) [28,29,30,31,32], Gauss-Seidel (GS) [33,34,35], successive over-relaxation (SOR) [36,37,38,39,40,41], and others [42,43,44,45] have been considered for massive MIMO detection for a better balance of complexity and performance. Constrained by the inherent properties, the convergence rate heavily depends on: i) the system loading factor α, and ii) the channel correlation coefficient ζ. First, α indicates that with a fixed number of antennas, the BS cannot serve a fast growing number of users simultaneously. It is a pity that the widely adopted antenna configuration in literature is 128 × 8 or 16 (BS v.s. users), and the corresponding detection methods failed to take those many-user cases into consideration. Second, for correlated channels with high ζ, performance of the SOA iterative linear detectors appear unstable. Third, the involved max-log-LLR computation related to either large-scale matrix inversion or iterative processing, will lead to high complexity and long latency. Therefore, an iterative linear detector which successfully combines stable convergence-rate and low complexity for a wide range of scenarios, is highly desired for 5G applications.
1.1.2 Precondition Processor
The conventional solution to improve the convergence rate is introducing preconditionor. The key design issue for the precondition processor is the sparse-matrix decomposition, which is complex and laborious. However, to the best knowledge of the authors, very few works have been reported for accelerating sparse-matrix decomposition. According to benchmark study, existing accelerations of matrix decomposition were achieved by exploring new memory-access methods [46,47,48,49]. However, those works mainly focused on conventional LU decomposition, which is not suitable for sparse matrix. Therefore, sparse-matrix decomposition with a minimum hardware cost is highly expected.
1.2 Contributions of This Paper
-
Algorithm Innovation: To this end, this paper first explores the convergence properties of iterative linear solvers. Based on those properties, an adaptive precondition scheme for general iterative linear detections in massive MU-MIMO uplink is proposed. This scheme ensures a fast convergence rate for common iterative linear detectors with specific consideration on complexity reduction. Adaptivity of the proposed precondition lies in the threshold selection, which is based on two critical parameters influencing the convergence rate:
-
1.
User-to-BS antennas ratio (also referred to as the system loading factor) α.
-
2.
Channel correlation coefficient ζ.
This threshold assists us to accelerate the convergence with acceptable overhead complexity, and is valid not only for the steepest descent (SD) detection but also the general iterative linear methods and scenarios. The convergence proof of this preconditioned detection is given in Section 3.
-
1.
-
Architecture Innovation: For the implementation issue, a novel implementation methodology for the adaptive preconditioned SD (APSD) detection is first proposed with iteration bound analysis, timing rescheduling, and architecture optimization. A novel incomplete decomposition for sparse matrix along with its hardware architecture is also proposed for complexity purpose. The proposed APSD detector architecture has been implemented with FPGA. The performance and complexity comparisons with the SOA detectors clearly demonstrate the advantages of our design. It should be noted that this implementation methodology is also applicable for other adaptive preconditioned detections.
1.3 Paper Outline
The remainder of this paper is organized as follows. Section 2 briefly describes massive MU-MIMO system model and MMSE detection scheme. Section 3 gives the preconditioned SD-based soft-output detection and its convergence proof. Hardware design for the proposed detection is presented in Section 5. FPGA implementation results and comparison with the SOA designs are given in Section 6. Finally, Section 7 concludes the entire paper.
1.4 Notations
In this paper, lowercase and upper bold face letters stand for column vectors and matrices, respectively. The operations (▪)T, \(\overline {(\centerdot )}\), (▪)H, and \(\mathbb {E}\{\centerdot \}\) denote the operations of transpose, complex conjugate, conjugate transpose, and expectation, respectively. The entry in the i-th row and j-th column of A is A (i,j). The vector a in each iteration j is a j . The A-norm of a vector ν with respect to the matrix A is defined as \(\|\mathbf {\nu }\|_{\mathbf {A}}\triangleq (\mathbf {A} \mathbf {\nu } \cdot \mathbf {\nu })^{1/2}\).
2 Linear Detection Model for Massive MU-MIMO Uplink
In this paper, to fully illustrate the advantages of the proposed scheme, its performance has been carefully evaluated for different channel scenarios of massive MU-MIMO uplink system. In this paper, a system with loading factor \(\alpha =\frac {U}{B}\) is assumed, where a BS equipped wtih B antennas serves U single-antenna users simultaneously via a flat-fading channel. Therefore, the received symbol vector \(\mathbf {y} \in \mathbb C^{B\times 1}\) at the BS can be modeled as:
where \(\mathbf {s} \in \mathbb C^{U\times 1}\) denotes the collection of transmitted symbols at the users side and \(\mathbf {n} \in \mathbb C^{B\times 1}\) denotes the independent identically distributed (i.i.d.) additive white Gaussian noise (AWGN) with zero mean and variance N 0. \(\mathbf {H} \in \mathbb C^{B \times U}\) is the B × U i.i.d. flat-fading channel response matrix with zero mean and unit variance. Details of H will be discussed below.
2.1 Correlated Channel Model
For better evaluation, the antenna correlation has been taken into account in this paper. More specifically, we adopt the Kronecker model in [50], and H can therefore be expressed as \(\mathbf {H} = \mathbf {R}_{r}^{1/2}\mathbf {T} \mathbf {R}_{t}^{1/2}\), where \(\mathbf T \in \mathbb C^{B \times U}\) is the B × U i.i.d. channel matrix with zero mean and unit variance. β represents the power attenuation caused by the large-scale fading (path-loss and shadowing). \(\mathbf {R}_{r} \in \mathbb C^{B \times B}\) and \(\mathbf {R}_{t} \in \mathbb C^{U \times U}\) are the spatial correlation matrices at BS and user side, respectively. Their entries are given as follows:
R(i,k) denotes the entry in the i-th row and k-th column. \(\mathbf {R}_{r}^{*}\) and \(\mathbf {R}_{t}^{*}\) are the conjugate matrices of R r and R t , respectively. Furthermore, we assume each cell with a free space reference distance d 0 covers an area with a radius of r, in which the users are randomly distributed. The distance between the i-th specific transmitter-receiver pair is d i . The large-scale fading effect is jointly determined by the average path-loss \(\left (\frac {d_{0}}{d_{i}}\right )^{\gamma }\) and the log-normal shadowing \(10log_{10}\psi \sim \mathcal {N}(0,\sigma ^{2}_{\psi })\) in a form of power attenuation \(F(i,i)=\left (\frac {d_{0}}{d_{i}}\right )^{\gamma }\psi \). To fully evaluate different massive MIMO detectors, we consider four common scenarios:
-
1.
Uncorrelated: When correlations at BS and user side are eliminated (i.e. both ζ r and ζ t are zero), \(\mathbf {R}_{r}^{1/2}\) and \(\mathbf {R}_{t}^{1/2}\) degrade to I B and I U , respectively. And for fair comparison with the benchmarks, the large-scale fading is ignored. In this case, matrix H is actually the ideal i.i.d. Rayleigh fading channel matrix.
-
2.
BS Correlated: For single-antenna users, the correlation among users is negligible. But for BS, correlation may exist due to the large-scale antenna array. Thus, the channel matrix \(\mathbf {H} = \mathbf {R}_{r}^{1/2} \mathbf {T} \mathbf {F}_{t}^{1/2}\), where F t = d i a g(F t (1,1),F t (2,2),…,F t (U,U)). F t (i,i) is the power attenuation factor between the i-th transmission stream and the BS.
-
3.
User Correlated: For multi-antenna users, if the distance between two BS antennas is large enough (lager than half-wavelength), correlation between BS antennas can be omitted. Then \(\mathbf {R}_{t}^{1/2}\) becomes diagonal matrix F r , and \(\mathbf {H}=\mathbf {F}^{1/2}_r\mathbf {T}\mathbf {R}_{t}^{1/2} \), where F r = d i a g(F r (1,1),F r (2,2),…,F r (B,B)). F r (i,i) represents the power attenuation between the i-th receiving antenna at the BS and the users.
-
4.
Fully Correlated: In this case, correlations take place at both sides. The matrix \(\mathbf {H}=\mathbf {R}_{r}^{1/2} \mathbf {T} \mathbf {R}_{t}^{1/2}\), where \(\mathbf {R}_{r}^{1/2}\) and \(\mathbf {R}_{t}^{1/2}\) are given by Eqs. 2 and 3, respectively.
2.2 MMSE Equalization Scheme
At the BS side, to minimize the mean squared error (MSE) of s, the MMSE equalizer [51] is employed:
where I is the identity matrix with dimension U and \(E_s=\mathbb {E} [\|s_{n}\|^2]\) is the average transmission power. For convenience, we denote
Then Eq. 4 can be simplified as
To exactly calculate \(\hat {\mathbf {s}}\) , we have to deal with the matrix inversion with complexity of \(\mathcal {O} (U^3)\). As U grows, the complexity may become prohibitive.
2.3 Basic Assumptions
The following assumptions are made throughout the rest of this paper. First, we assume the channel fading matrix H as well as the correlation coefficients ζ r and ζ t are perfectly known at the receiver. Second, half-rate convolutional code is employed to mitigate the channel fading. Last but not least, the per-bit LLR is directly computed after the channel equalization and the results are passed to the standard Viterbi decoder [52].
3 Adaptive Preconditional Iterative Linear Detection
In this section, we start with the convergence theory of canonical iterative methods regarding different channel conditions. Based on the theoretical analysis, an adaptive preconditioned method to improve the convergence is proposed. Then, a more efficient LLR computation method is introduced for further complexity reduction.
3.1 Convergence Analysis for Different Channel Conditions
The MMSE equalization in Eq. 6 can be rewritten as \(\mathbf {A} \hat {\mathbf {s}} = \tilde {\mathbf y}\), which is a typical linear system equation with a Hermitian positive definite (HPD) coefficient matrix A. Here, we focus on the convergence issue of iterative linear solvers. Specific concerns have been put on cases when the system loading factor α approaching 1 and the correlation is inevitable.
Definition 1
The SCN of a normal matrix A (with the respect to ℓ 2 norm) is denoted by:
where λ max(A) and λ min(A) are the greatest and least eigenvalues of A, respectively.
Indicated by [53], the mathematical principle behind is that, for HPD problems the convergence rate of iterative linear algorithms is highly related to the SCN of matrix A. Usually, higher SCN leads to lower convergence rate. Take the commonly used steepest descent (SD) algorithm as an example, in its i-th iteration, the upper bound of estimation error is:
Without loss of generality, SD algorithm is employed as a running example in the followingFootnote 1. Therefore, we aim to identify efficient approach to reduce the SCN of A for convergence consideration. The analysis in [55] implies two points:
-
1.
[α issue]: For an uncorrelated Rayleigh fading environment, increasing the antenna number at the BS reduces the probability of experiencing an ill-conditioned channel (system with large SCN).
-
2.
[ζ issue]: The presence of spatial correlation tends to increase the probability of system’s experiencing a highly ill-conditioned channel.
According to the aforementioned observations, we have the following conclusion:
Corollary 1
For massive MIMO channel H with loading factor α(H)and correlation degree ζ(H), the SCN of the regularized Gram matrix A , κ(A)is bounded by the relations: κ(A) ∝ α(H)and κ(A) ∝ ζ(H).
Proof
See Appendix A. □
Since the minimum SCN of a normal matrix is one, iterative linear algorithms can achieve a high convergence rate if A is an identity matrix. This observation inspires us to explore adaptive precondition to lower the range of A’s SCN within the given boundaries.
3.2 Adaptive Precondition to Accelerate Convergence
As illustrated above, higher loading factor α and channel correlation ζ introduce an ill-conditioned MMSE detection problem. To improve the convergence rate of iterative detection for this problem, an efficient precondition based on incomplete LDLT (ILDLT) decomposition is proposed to lower the SCN for the linear equation \(\mathbf {A} \hat {\mathbf {s}} = \tilde {\mathbf y}\). Details of this precondition algorithm are given in Algorithm 1 as follows:
It should be noted that if i ⋅ j = 0, (▪)(i,j) = 0. Here, η is a function of ζ r , ζ t , and α. The explicit expression of η in this work is obtained from the analysis in Section 3.1 with a heuristic perspective:
With L and D obtained from Algorithm 1, we modify the original linear equation \(\mathbf {A} \hat {\mathbf {s}} = \tilde {\mathbf y}\) into
The new coefficient matrix M A is defined by Eq. 11. It is an approximation of identity matrix (i.e. its SCN is close to one).
Numerical results in Fig. 1 show consistency with Corollary 1 and the efficiency of proposed adaptive precondition.
3.3 Adaptive Preconditioned Iterative Linear Detection
Now, we combine the adaptive precondition approach with SD detection. The proposed APSD detection is shown in Algorithm 2 as follows.
Here, J is the maximum iteration number. Usually for J = 2 or 3, the estimated result will converge to the final result \(\hat {\mathbf {s}}\). Readers can refer to Appendix B for derivation of Algorithm 2.
3.4 Efficient Approximated LLR Calculation
For finite length and robustness considerations, the soft messages in the above algorithms are in the form of logarithm likelihood ratio (LLR). In this subsection, the efficient and hardware-friendly LLR calculation is proposed. In general, LLR for the b-th bit in the n-th symbol is given by:
According to [56], we have
Substitute (4) into (13), we have
and
It has been shown that, M in Eq. 10 is the coarse approximation of A −1 and only the diagonal entries of A −1 are needed. With the diagonal matrix D obtained by Algorithm 1, we simplify (15) as:
where \(\gamma =N_0E_s^{-1}\). Assume code bits are equally likely and independently Gaussian distributed. The approximated expression for \(\delta _{\hat {s}_n, b}\) is further derived in Eq. 17, where |⋅| is the absolute value and \(\mathbb {Q}_{b = 0,1}\) stands for the collections of constellations when the b-th bit equals 0 or 1, respectively. The LLR values are finally passed to the Viterbi decoder to recover s without inverting A.
4 Numerical Verification of Performance
Here, numerical results for different massive MU-MIMO scenarios are given to verify the advantages of the APSD detection with approximated LLRs over the SOA baselines. The free space reference distance d 0 = 1 covers an area with a radius of d, in which the users are randomly distributed. The overall large-scale fading effect is jointly determined by the average path-loss \(\left (\frac {d_{0}}{d}\right )^{\gamma }\) and the log-normal shadowing \(10log_{10}\psi \sim \mathcal {N}(0,\sigma ^2_{\psi })\) in a form of power attenuation \(\rho =\left (\frac {d_{0}}{d}\right )^{\gamma }\psi \). Furthermore, the antenna correlation has been taken into account in this paper. “Cholesky” is the traditional Cholesky decomposition scheme. “CG_Exact” and “CG_Approx” are the recently proposed CG-based approaches [28,29,30,31] with the exact coefficient C (n,n) and approximate C (n,n), respectively. “APSD_Exact” and “APSD_Approx” are the proposed APSD detection with exact coefficient C (n,n) and approximate C (n,n). The DVB [171oct, 133oct] rate-1/2 convolutional code, 64 quadrature amplitude modulation (QAM), and B = 128 are considered.
Figure 2 shows the BER performance of proposed APSD detection against aforementioned baselines: i) Fig. 2a focuses on the impact of correlation factor ζ, and ii) Fig. 2b considers the effect of loading factor α. iii) Fig. 2c compares the convergence rates of different schemes.
In Fig. 2a, BER performances under different correlation conditions are compared. Specifically, we have: i) User Correlated case (ζ t = 0.2,ζ r = 0), ii) BS Correlated case (ζ t = 0,ζ r = 0.4), and iii) Fully correlated case (ζ t = 0.2,ζ r = 0.4). Here we assume d 0 = 1, γ = 3.5, \(\sigma ^2_{\psi }= 5\) and d i is a random number in [1,100]. Though the Cholesky decomposition scheme (Cholesky) offers the best performance, its performance gain over the proposed scheme is negligible. Furthermore, Cholesky’s complexity is quite huge as shown later. Also, the proposed scheme easily outperforms the recently proposed CG-based algorithm (“APSD_Exact”) in all cases with gain varying from 0.5 dB to 1 dB at BER = 10−3. Now it is concluded that performance degradation resulting from the efficient LLR computation in Section 3.4 is marginal (e.g. less than 0.2 dB than Cholesky scheme). Finally, it is also obsereved that the proposed scheme is quite robust when the correlation factor ζ varies.
In Fig. 2b, uncorrelated scenarios with different loading factors α are considered. When α is small (i.e. α ∈ [0,0.1]), the performances of all the schemes are similar. On the other side of this plot, when α reaches its maximum value 1, both the CG-based algorithms (“CG_Exact” and “CG_Approx”) and the APSD ones (“APSD_Exact” and “APSD_Approx”) become worse. However, when α ∈ [0.2,0.7] the proposed scheme outperforms CG-based algorithms. It is concluded that compared with the SOA approaches the proposed scheme is more suitable for scenarios when the loading factor changes in a wide range.
In Fig. 2c, the convergence property of the proposed scheme are compared with baselines by changing the number of iterations J in uncorrelated scenario. Both “APSD_Approx” and the proposed scheme achieve satisfactory performance, and the gaps from the optimal approach (“Cholesky”) are less than 0.1 dB when J = 2 and 3 at BER = 10−3. This convergence advantage is important for practical applications, which cannot afford too many iterations due to the latency and complexity considerations. On the other hand, the CG-based algorithms (“CG_Exact” and “CG_Approx”) still suffer a obvious degradation compared to “Cholesky” at BER = 10−3 even when J = 3.
To sum up, the proposed APSD detection with approximate LLR computation shows good robustness against varying channel correlation and loading factor. Its convergence rate is also superior over the SOA approaches. For practical massive MU-MIMO scenarios, the proposed APSD detection offers a good option balancing performance and complexity.
5 Hardware Architecture Design
In this section, implementation methodology for adaptive preconditioned iterative linear MIMO detection is proposed. The hardware architecture is given with iteration bound analysis and resource optimization. FPGA implementation has also been done to demonstrated the feasibility of the proposed method. Comparison with the SOA design has shown its advantages and indicated its possible adoption by practical systems.
5.1 Data-Flow Graph Analysis
The straightforward hardware block diagram of Algorithm 2 is illustrated in Fig. 3. It is shown that there are a total of three units involved: i) Initialization Unit, ii) Iteration Unit, and iii) Output Unit.
For iteration bound analysis, the data-flow graph (DFG) is given in Fig. 4. The mapping relationship between DFG nodes and modules in Fig. 3 is in the legend part. According to Fig. 3, we have:
Remark 1
The iteration bound of equals (T 7 + T 5 + T 6). The critical path equals (T 1 + T 2 + T 3 + T 5 + T 8 + T 9). Here, T i is the processing time of Node i.
Proof
In Fig. 4, there are 3 loops, namely l 1, l 2 and l 3:
Therefore, the iteration bound is:
Also, the critical path is p:
It should also be noted that p is also constrained by l 2 and l 3 due to the dependencies shown in Fig. 4. □
According to Remark 1, the optimized hardware architecture is derived in Fig. 5, where special attentions have been paid to both l 3 and p.
5.2 Initialization Unit
In Initialization Unit, GRAM MATRIX module employs a triangular systolic array of (1 + U)U/2 processing elements (PEs) to compute the lower triangular part of A, namely T A . MATCHED FILTER module computes \(\tilde {\mathbf {y}} = \mathbf {H}^{H} \mathbf y\) with a similar PE. (See [34] for details.) GRAM MATRIX module’s results are pipelined column-by-column to CCS (compressed column storage) module below.
5.2.1 CCS Module
Having the i-th column of T A (T A (:,i)), U CCS modules are triggered in a pipelined sequence. To be more specific, the CCS module associated with the i-th column receives data with 2 × (U − i + 1) input ports (every (U − i + 1) ports are for the real part and imaginary part, respectively). It performs one port shift due to the fact that the output of GRAM MATRIX module is produced by each PE, clock after clock. Then the original data will be output to a comparator and finally stored by RAM in a standard CCS format [57]. This procedure performs the outer for loop of Algorithm 1 (Line 2). The blue box in Fig. 5 gives the architecture.
5.2.2 ILDLT PROC Module
With CCS-formatted T A C , ILDL T PROC module performs Algorithm 1 by a co ntrol unit. The other blocks are: a multiplication accumulator (MAC), a reciprocal unit, and an adder. Its operation flow is in Algorithm 3. The architecture is in the yellow box of Fig. 5.
5.3 Iteration Unit
Iteration Unit takes care of all the iterative computations. According to Section 5.1, the iteration bound is (T 7 + T 5 + T 6), which is the sum of the processing time of Node 7 (complex adder in Fig. 3), Node 5 (TRI-L SOLVER module), and Node 6 (ITERATION module). Therefore, the core task of this part lies in the design of TRI-L SOLVER module and ITERATION module.
5.3.1 TRI-L SOLVER Module
To solve \(\mathbf {z}^{(j)} =(\mathbf {L}\mathbf {D}\mathbf {L}^H)^{-1} \tilde {\mathbf {y}}^{(j)}\) in Algorithm 2, we split this equation into three equations:
Since L and L H are both triangular matrices with all-one diagonal entries and D is a diagonal matrix, the computation consisting of complex multiplication, addition (subtraction), and scalar division can be implemented in the similar manner as in ILDL T PROC module. Since there is no schedule overlap between TRI-L SOLVER module and ILDL T PROC module (Fig. 5), folding technique can be also considered.
5.3.2 ITERATION Module
ITERATION module in Fig. 3 calculates Lines 5 to 8 in Algorithm 2. Key operations are matrix-vector multiplication, inner product, scalar-vector multiplication, and reciprocal, where multiplication and reciprocal can be implemented by MAC and LUT, respectively. Its architecture is shown in the green box of Fig. 5
5.4 Output Unit
In Output Unit, \(\hat {\mathbf {s}}^{(J)}\) is obtained via a MAC, by accumulating the products of TRI-L SOLVER module’s result and ITERATION module’s result for J iterations. Then the LLR is calculated according to [58].
5.5 Timing Analysis
The overall schedule of the proposed APSD detector is shown in Fig. 6. Same as [34], GRAM MATRIX module and MATCH FILTER module require (B + 2U − 1) clock cycles and (B + U − 1) clock cycles, respectively. CCS module spends one clock per entry since the first entry of current column is output by GRAM MATRIX module. Thus, it is active from the (B + 1)-th cycle to the (B + 2U)-th cycle.
The timing analysis of I L D L T P R O C module is more complicated due to the variety of the sparsity structure of T A C in different scenarios. According to Algorithm 3, the maximum processing time of this unit is T s = (U + U 2)/2 (if no value is dropped from the original matrix). Then, ρ s (≪ 1) is introduced as an assistant parameter to specify the exact cycles spent by ILDL T PROC module. Similar analysis with ρ t (≪ 1) is conducted for TRI-L SOLVER module. If no value is dropped from the original matrix, the maximum processing time T t = (U 2 − U)/2 + (U − 1).
6 FPGA Implementation and Comparison
To demonstrate the advantages of the proposed APSD detector with the SOA designs, implementation and comparison have been given for 128 × 8 massive MU-MIMO system with Xilinx Virtex-7 XC7VX690T FPGA. 16-bit quantization is adopted for both input and output per complex-dimension. All multipliers are 32-bit. The LUT of reciprocal module has 256 addresses and 8-bit outputs. Performance of these quantization schemes has been verified by the “fixedAPSD_approx” curve in Fig. 2c. The fixed-point simulation result shows negligible loss, which demonstrates the efficiency of the proposed hardware design.
In Table 1, the proposed APSD design is compared with the CG-based method (CGLS) [31] and the approximated NSE method [25]. Same FPGA and antenna configuration (128 × 8) are considered. It is shown that the proposed design achieves higher throughput than the CGLS with a better performance in various scenarios. Though the approximated NSE method can achieve higher throughput, the proposed design has the highest hardware efficiency.
It should be noted that, since all our benchmarks did not show their power information, power consumption is not compared in Table 1. More relevant comparison based on overall power consumption will be given in our future work. A similar issue is, since all benchmarks only consider the ideal case, in Table 1 we only list the result of Case 1 in Section 2.1 for fair comparison. In fact, for trade-offs for different correlation coefficients, only processing latency will vary in different cases, which will finally affect the detection throughput.
7 Conclusion
In this paper, adaptive preconditioned iterative linear detection is proposed for correlated massive MU-MIMO uplink. Convergence rate of the algorithm has been carefully studied to balance the complexity and performance. Efficient LLR computation method has also been proposed for further complexity reduction. Theoretical as well as numerical results have demonstrated the advantages of the proposed method over the SOA designs. Implementation design methodology is proposed with architectural and timing optimization. FPGA implementation results of APSD detector have verified its application feasibility for practical massive MU-MIMO systems.
Notes
In fact, convergence rate of many canonical iterative linear methods, such as Jacobi, GS, SOR, and CG, heavily relies on the SCN [54].
References
Adachi, F. (2001). Wireless past and future? Evolving mobile communications systems. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 84(1), 55–60.
Akyildiz, I.F., Gutierrez-Estevez, D.M., & Reyes, E.C. (2010). The evolution to 4G cellular systems: LTE-Advanced. Physical Communication, 3(4), 217–244.
Ghosh, A., & Ratasuk, R. (2011). Essentials of LTE and LTE-A. Cambridge: Cambridge University Press.
Andrews, J.G., Buzzi, S., Choi, W., Hanly, S.V., Lozano, A., Soong, A.C., & Zhang, J.C. (2014). What will 5G be?. IEEE Journal of Selection and Areas Communication, 32(6), 1065–1082.
Wang, C. -X., Haider, F., Gao, X., You, X., et al. (2014). Cellular architecture and key technologies for 5G wireless communication networks. IEEE Communications Magazine, 52(2), 122–130.
Hoydis, J., Ten Brink, S., & Debbah, M. (2013). Massive MIMO in the UL/DL of cellular networks: How many antennas do we need?. IEEE Journal of Selection and Areas Communication, 31(2), 160–171.
Björnson, E., Sanguinetti, L., Hoydis, J., & Debbah, M. (2015). Optimal design of energy-efficient multi-user MIMO systems: Is massive MIMO the answer?. IEEE Transactions in Wireless Communication, 14(6), 3059–3075.
Larsson, E.G., Edfors, O., Tufvesson, F., & Marzetta, T.L. (2014). Massive MIMO for next generation wireless systems. IEEE Communications Magazine, 52(2), 186–195.
Lu, L., Li, G.Y., Swindlehurst, A.L., Ashikhmin, A., & Zhang, R. (2014). An overview of massive MIMO: Benefits and challenges. IEEE Journal of Selection Topics Signal Processing, 8(5), 742–758.
Rusek, F., Persson, D., Lau, B.K., et al. (2013). Scaling up MIMO: Opportunities and challenges with very large arrays. IEEE Signal Processing Magazine, 30(1), 40–60.
Micciancio, D. (2001). The hardness of the closest vector problem with preprocessing. IEEE Transactions on Information Theory, 47(3), 1212–1215.
Verdú, S. (1989). Computational complexity of optimum multiuser detection. Algorithmica, 4(1-4), 303–312.
van Etten, W. (1976). Maximum likelihood receiver for multiple channel transmission systems. IEEE Transactions on Communications, 24(2), 276–283.
Chen, P., & Kobayashi, H. (2002). Maximum likelihood channel estimation and signal detection for OFDM systems. In Proceedings of the IEEE International Conference on Communication (ICC), (Vol. 3 pp. 1640–1645).
Damen, M.O., El Gamal, H., & Caire, G. (2003). On maximum-likelihood detection and the search for the closest lattice point. IEEE Transactions on Information Theory, 49(10), 2389–2402.
Klein, A., Kaleh, G.K., & Baier, P.W. (1996). Zero forcing and minimum mean-square-error equalization for multiuser detection in code-division multiple-access channels. IEEE Transactions on Vehicular Technology, 45 (2), 276–287.
Shnidman, D. (1967). A generalized nyquist criterion and an optimum linear receiver for a pulse modulation system. Bell System Technical Journal, 46(9), 2163–2177.
Meijerink, J.A., & et al. (1977). An iterative solution method for linear systems of which the coefficient matrix is a symmetric M-matrix. Mathematics of Computation, 31(137), 148–162.
Prabhu, H., Rodrigues, J., Edfors, O., & Rusek, F. (2013). Approximative matrix inverse computations for very-large MIMO and applications to linear pre-coding systems. In Proceedings of the IEEE Wireless Communication and Network Conference (WCNC) (pp. 2710–2715).
Zhu, D., Li, B., & Liang, P. (2015). On the matrix inversion approximation based on neumann series in massive MIMO systems. In Proceedings of the IEEE International Conference on Communication (ICC) (pp. 1763–1769).
Wang, F., Zhang, C., Yang, J., Liang, X., You, X., & Xu, S. (2015). Efficient matrix inversion architecture for linear detection in massive MIMO systems, in Proc. In Proceedings of the IEEE International Conference on Digital Signal Processing (DSP) (pp. 248–252).
Wu, M., Yin, B., Vosoughi, A., Studer, C., Cavallaro, J.R., & Dick, C. (2013). Approximate matrix inversion for high-throughput data detection in the large-scale MIMO uplink. In Proceedings of the IEEE International Symposium on Circuits and System (ISCAS) (pp. 2155–2158).
Yin, B., Wu, M., Wang, G., Dick, C., Cavallaro, J.R., & Studer, C. (2014). A 3.8 Gb/s large-scale MIMO detector for 3GPP LTE-Advanced. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3879–3883).
Liang, X., Zhang, C., Xu, S., & You, X. (2015). Coefficient adjustment matrix inversion approach and architecture for massive MIMO systems (pp. 1–4).
Wu, M., Yin, B., Wang, G., et al. (2014). Large-scale MIMO detection for 3GPP LTE: Algorithms and FPGA implementations. IEEE Journal of Selection Topics Signal Processing, 8(5), 916–929.
Wang, F., Zhang, C., Liang, X., Wu, Z., Xu, S., & You, X. (2015). Efficient iterative soft detection based on polynomial approximation for massive MIMO. In Proceedings of the International Conference on Wireless Communications Signal Processing (WCSP) (pp. 1–5).
Wu, Z., Zhang, C., Zhang, S., & You, X. (2016). Adjustable iterative soft-output detection for massive MIMO uplink (pp. 1–5).
Yin, B., Wu, M., Cavallaro, J.R., et al. (2014). Conjugate gradient-based soft-output detection and precoding in massive MIMO systems. In Proceedings of the IEEE Global Communication Conference (GLOBECOM) (pp. 3696–3701).
Hu, Y., Wang, Z., Gaol, X., & Ning, J. (2014). Low-complexity signal detection using CG method for uplink large-scale MIMO systems. In Proceedings of the IEEE International Conference on Communication Systems (ICCS) (pp. 477–481).
Zhou, J., Hu, J., Chen, J., & He, S. (2015). Biased MMSE soft-output detection based on conjugate gradient in massive MIMO, in Proc. In Proceedings of the IEEE International Conference on ASIC (ASICON) (pp. 1–4).
Yin, B., & et al. (2015). VLSI design of large-scale soft-output MIMO detection using conjugate gradients. In Proceedings of the IEEE International Symposium on Circuits and System (ISCAS) (pp. 1498–1501).
Xue, Y., Zhang, C., Zhang, S., & You, X. (2016). A fast-convergent pre-conditioned conjugate gradient detection for massive MIMO uplink. In Proceedings of the IEEE International Conference on Digital Signal Processing (DSP) (pp. 331–335).
Dai, L., Gao, X., Su, X., Han, S., et al. (2015). Low-complexity soft-output signal detection based on Gauss-Seidel method for uplink multiuser large-scale MIMO systems. IEEE Transactions on Vehicular Technology, 64(10), 4839–4845.
Wu, Z., Zhang, C., Xue, Y., Xu, S., & You, X. (2016). Efficient architecture for soft-output massive MIMO detection with Gauss-Seidel method, in Proc. In Proceedings of the IEEE International Symposium on Circuits and System (ISCAS) (pp. 1886–1889).
Wu, Z., Xue, Y., You, X., & Zhang, C. (2017). Hardware efficient detection for massive MIMO uplink with parallel Gauss-Seidel method. In Proceedings of the International Conference on Digital Signal Processing (DSP) (pp. 1–5).
Gao, X., Dai, L., Hu, Y., et al. (2014). Matrix inversion-less signal detection using SOR method for uplink large-scale MIMO systems (pp. 3291–3295).
Ning, J., Lu, Z., Xie, T., & Quan, J. (2015). Low complexity signal detector based on SSOR method for massive MIMO systems (pp. 1–4).
Guo, R., Li, X., Fu, W., et al. (2015). Low-complexity signal detection based on relaxation iteration method in massive MIMO systems. China Communication, 12(Supplement), 1–8.
Zhang, P., Liu, L., Peng, G., et al. (2016). Large-scale MIMO detection design and FPGA implementations using SOR method. In Proceedings of the IEEE International Conference on Communication Software and Network (ICCSN) (pp. 206–210).
Yu, A., Zhang, C., Zhang, S., & You, X. (2016). Efficient SOR-based detection and architecture for large-scale MIMO uplink. In Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems (APCCAS) (pp. 402–405).
Wu, Z., Zhang, C., Xue, Y., Xu, S., & You, X. (2016). Efficient architecture for soft-output massive MIMO detection with Gauss-Seidel method. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1886–1889).
Gao, X., Dai, L., Ma, Y., et al. (2014). Low-complexity near-optimal signal detection for uplink large-scale MIMO systems. Electronics Letters, 50(18), 1326–1328.
Qin, X., Yan, Z., & He, G. (2016). A near-optimal detection scheme based on joint steepest descent and jacobi method for uplink massive MIMO systems. IEEE Computing Letters, 20(2), 276–279.
Song, W., Chen, X., Wang, L., et al. (2016). Joint conjugate gradient and Jacobi iteration based low complexity precoding for massive MIMO systems. In Proceedings of the IEEE/CIC International Conference on Communication in China (ICCC) (pp. 1–5).
Xue, Y., Zhang, C., Zhang, S., Wu, Z., & You, X. (2016). Steepest descent method based soft-output detection for massive MIMO uplink. In Proceedings of the IEEE International Workshop on Signal Processing Systems (SiPS) (pp. 273–278).
Wang, X., Jones, P.H., & Zambreno, J. (2016). A configurable architecture for sparse LU decomposition on matrices with arbitrary patterns. ACM SIGARCH Computer Architecture News, 43(4), 76–81.
Jaiswal, M.K., & Chandrachoodan, N. (2012). FPGA-based high-performance and scalable block LU decomposition architecture. IEEE Transactions on Computing, 61(1), 60–72.
Wu, G., Xie, X., Dou, Y., et al. (2012). Parallelizing sparse LU decomposition on FPGAs. In Proceedings of the International Conference on Field-Program Technology (FPT) (pp. 352–359).
Johnson, J., Chagnon, T., et al. (2008). Sparse LU decomposition using FPGA. In Proceedings of the International Workshop on State-of-the-Art in Science and Parallel Computinf (PARA).
Godana, B.E., & Ekman, T. (2013). Parametrization based limited feedback design for correlated MIMO channels using new statistical models. IEEE Transaction Wireless Communication, 12(10), 5172–5184.
Sibille, A., Oestges, C., & Zanella, A. (2010). MIMO: from theory to implementation. Cambridge: Academic Press.
Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.
Benzi, M. (2002). Preconditioning techniques for large linear systems: A survey. Journal of Computational Physics, 182(2), 418–477.
Saad, Y. (2003). Iterative methods for sparse linear systems. USA: SIAM.
Matthaiou, M., McKay, M.R., Smith, P.J., et al. (2010). On the condition number distribution of complex Wishart matrices. IEEE Transactions on Communications, 58(6), 1705–1717.
Seethaler, D., Matz, G., et al. (2004). An efficient MMSE-based demodulator for MIMO bit-interleaved coded modulation. In Proceedings of the IEEE Global Communication Conference (GLOBECOM), (Vol. 4 pp. 2455–2459).
Bai, Z., Demmel, J., Dongarra, J., et al. (2000). Templates for the solution of algebraic eigenvalue problems: A practical guide. USA: SIAM.
Collings, I.B., Butler, M.R., & McKay, M. (2004). Low complexity receiver design for MIMO bit-interleaved coded modulation (pp. 12–16).
Ordóñez, L.G., Palomar, D.P., & Fonollosa, J.R. (2009). Ordered eigenvalues of a general class of hermitian random matrices with application to the performance analysis of MIMO systems. IEEE Transactions Signal Processing, 57(2), 672–689.
Acknowledgements
This work is partially supported by NSFC under grants 61501116, 61571105, and 61701293, Jiangsu Provincial NSF under grant BK20140636, State Key Laboratory of ASIC & System under grant 2016KF007, Student Research Training Program of SEU, Huawei HIRP Flagship under grant YB201504, Intel Collaborative Research Institute for MNC, the Fundamental Research Funds for the Central Universities, and the Project Sponsored by the SRF for the Returned Overseas Chinese Scholars of State Education Ministry.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Proof of Corollary 1
The SCN of a given matrix A, which is defined as κ(A), is the ratio of the largest and smallest eigenvalues. Since the MMSE equalization matrix A can be decomposed as:
where U is the unitary matrix and Λ is the diagonal matrix with eigenvalues as its entries.
To obtain the abstract of the distribution of κ(A), the joint probability density function (PDF) of the ordered eigenvalues of A is required. Fortunately, for each eigenvalue of matrix A, we have
Here, H H H is a complex central Wishart distribution with B degree of freedom and U eigenvalues. The joint PDF of the ordered eigenvalues of H H H for Cases 1, 2, and 3 in Section 2.1 is given by [59] and the corresponding analysis of the SCN distribution is given by [55]. Thus, the joint PDF of the ordered eigenvalues, \(\mathbf {\lambda }\triangleq [\lambda _{1}, \lambda _{2}\ldots \lambda _{U}]\) with λ 1 ≥ λ 2 ≥… ≥ λ U ≥ 0 of A is
According to Eq. 23, we have
This indicates that λ A and λ H H H have the same distribution when only the influence of system loading factor α and channel correlation coefficient ζ are considered. Finally, Corollary 1 can be proved by the result in [55] (See Statement 1 and 2 in Section 3.1).
Appendix B: Derivation of APSD Detection Algorithm
In classical SD method, \(\hat {\mathbf {s}}\) is iteratively calculated [54] by:
Since z (j− 1) is the iterative residual, −z (j− 1) is the iterative research direction and α (j− 1) is the iterative research length in the (j − 1)-th iteration.
In the proposed APSD algorithm, the system is preconditioned from \(\mathbf {A}\hat {\mathbf {s}}=\tilde {\mathbf {y}}\) to \(\mathbf {(LDL^H)}^{-1}\mathbf {A}\hat {\mathbf {s}}=\mathbf {(LDL^H)}^{-1}\tilde {\mathbf {y}}\). The iterative residual in the j-th iteration changes from \(\tilde {\mathbf y}^{(j)}\) in
to
where
Substitute s (j) to Eq. 26, then \(\tilde {\mathbf {y}}^{(j)}\) is calculated by:
To derive α (j− 1), we consider the basic principle of SD method that, the set of search direction z should satisfy
Then α (j− 1) is obtained based on Eqs. 28, 30, and 31,
Summarize Eqs. 26, 28, 29, and 32, the proposed algorithm is obtained
Combining with Algorith1, we can obtain Algorithm 2.
Rights and permissions
About this article
Cite this article
Xue, Y., Wu, Z., Yang, J. et al. Adaptive Preconditioned Iterative Linear Detection and Architecture for Massive MU-MIMO Uplink. J Sign Process Syst 90, 1453–1467 (2018). https://doi.org/10.1007/s11265-017-1317-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-017-1317-8