1 Introduction

Low-Density Parity-Check (LDPC) codes are a class of advanced error-correcting codes broadly used in modern telecommunication. It was first invented by Gallager [1] in 1962 and rediscovered by MacKay and Neal [2] in 1996. Up to now, LDPC codes have been widely applied in many communication systems, e.g., 4G/5G, 10GBase-T Ethernet (IEEE 802.3an), Wi-Fi (IEEE 802.11n& 802.11ac), WiMAX (IEEE 802.16e), and DVB-S2.

The decoding of channel codes is an important issue. Belief Propagation (BP) is the optimal decoding algorithm of LDPC codes, in the sense that channel states are exactly known. Nevertheless, in the absence of exact knowledge of channel states, the optimal decoding of LDPC codes still remains open. In 2012, Fang [3] proposed the Sliding-Window BP (SWBP) algorithm to handle this issue, which can exactly trace time-varying channel state. In many experiments, this novel technique exhibits near-limit performance. Meanwhile, it is easy to be implemented and insensitive to initial settings. In addition, another outstanding advantage of SWBP is its convenience for parallelization. Although SWBP is very attractive in practice, it suffers from intensive computation. In SWBP, the optimal window size is the key factor for estimating channel state, which is found by exhaustively evaluating all possible window sizes. If SWBP is implemented serially, more running time is needed for longer LDPC codes. Therefore, parallelization of SWBP is indispensable.

In recent years, Graphics Processing Units (GPUs) demonstrate powerful computing ability and evolve into General Purpose Computing Units. GPUs can accelerate many computationally intensive problems by thousands of parallel executing threads. In addition, NVIDIA provides the Compute Unified Device Architecture (CUDA) as a programming interface for researchers to develop applications on GPUs in C-like languages.

Since BP-based LDPC decoding algorithm can be easily implemented in parallel, many LDPC decoders accelerated by GPUs have been presented by researchers [4, 5]. However, to our best knowledge, parallel SWBP still remains uninvolved. In this paper, we speed up the SWBP algorithm on GPUs. There are two major bottlenecks in SWBP. The first one is belief-passing of standard BP between Variable Nodes (VNs) and Check Nodes (CNs). Since this problem has been addressed by many parallel algorithms, we will not further discuss it in this paper, and the reader can refer to [6,7,8] for details. The second one is estimating the best window size, which is a time-consuming process. We will tackle this problem by simultaneously computing the metrics of different window sizes. The coalesced memory access will be applied to accelerate the reading and writing operations. The intermediate variables will be stored in registers and shared memory to reduce memory latency. We will use three different algorithms, i.e., thrust, cuBLAS, and reduction kernel, to find the minimum element of an array.

The rest of this paper is arranged as below. Section 2 reviews the background knowledge of SWBP and GPU. Section 3 introduces the parallelized SWBP on GPUs using CUDA. Section 4 reports experimental results. Finally, Sect. 5 concludes this paper.

2 Review on Background

2.1 Preliminaries

Let n be the length of the used LDPC code, which is specified by an \(m\times n\) parity check matrix \(\mathbf{H}\). Let \(\mathbf{x} = (x_1,\ldots ,x_n)^\top \) be the codeword sent by the encoder and \(\mathbf{y} = (y_1,\ldots ,y_n)^\top \) be the codeword received at the decoder. Let us multiply \(\mathbf H \) with \(\mathbf{x}\) to get syndrome \(\mathbf{c} = (c_1,\ldots ,c_m)^\top = \mathbf{Hx}\). For simplicity, this paper considers the memoryless Binary Symmetric Channel (BSC) model, i.e., \(\Pr (x_i\ne y_i)=p_i\). If the channel is stationary, then \(p_1=\cdots =p_n\).

BP is a popular algorithm to decode LDPC code. To introduce it, we first define some notations. Let \(\mathbf{v} = (v_1,\ldots ,v_n)^\top \) denote the overall Log-likelihood-Ratios (LLRs) of VNs, including both intrinsic and extrinsic LLRs. Let \(q_{ij}\) be the LLR propagated from the ith VN to the jth CN and \(r_{ji}\) be the LLR propagated from the jth CN to the ith VN. Let \(dv_i\) be the degree of the ith VN and \(dc_j\) be the degree of the jth CN. Let \(\tilde{\mathbf{x}}\) be the estimation of \(\mathbf{x}\). We illustrate the standard BP in Algorithm 1.

figure a

The LDPC decoder with SWBP algorithm includes three phases: standard BP, window size setting, and local bias probability refinement. Detailed explanation of SWBP can be found in [3]. Since many researchers have used GPUs to accelerate BP algorithm in LDPC decoder [4, 5], we will ignore standard BP phase in this paper.

2.2 Bias Probability Estimation

To estimate the bias probability of each BSC sub-channel, we re-compute local bias probability \(\tilde{p}_i\) by averaging the overall beliefs of neighboring variable nodes in a size-s (an odd) window around \(y_i\). Let \(b_i=\frac{1}{1+\exp (v_i)}\). Then

$$\begin{aligned} {\tilde{p}}_i\mathrm{{ = }}\frac{{\mathrm{{ - }}{b_i} + \sum \nolimits _{i' = \max (1,i - h)}^{\min (i + h,n)} {{b_{i'}}} }}{{\min (i + h,n) - \max (1,i - h)}}, \end{aligned}$$
(1)

where \(h = \left\lfloor {s/2} \right\rfloor \). Obviously, (1) can be easily deduced as

$$\begin{aligned} {{\tilde{p}}_i}\mathrm{{ = }}\left\{ \begin{array}{l} \displaystyle {{\tilde{p}}_{i\mathrm{{ - }}1}}\mathrm{{ + }}\frac{{{b_{i - 1}} - {b_i} + {b_{i + h}} - {{\tilde{p}}_{i\mathrm{{ - }}1}}}}{{i + h - 1}},\quad 2 \le i \le (1\mathrm{{ + }}h) \\ \displaystyle {{\tilde{p}}_{i\mathrm{{ - }}1}}\mathrm{{ + }}\frac{{{b_{i - 1}} - {b_i} + {b_{i + h}} - {b_{i\mathrm{{ - }}h\mathrm{{ - }}1}}}}{{i + h - 1}},2\mathrm{{ + }}h \le i \le (n - h) \\ \displaystyle {{\tilde{p}}_{i\mathrm{{ - }}1}}\mathrm{{ + }}\frac{{{b_{i - 1}} - {b_i} + {b_{i - h - 1}} - {{\tilde{p}}_{i\mathrm{{ - }}1}}}}{{n - i + h}},n - h + 1 \le i \le n \\ \end{array}. \right. \end{aligned}$$
(2)

2.3 Window-Size Setting

Estimating an appropriate s is the key step of SWBP, where s is the window-size. To address this problem, [3] uses Mean Squared Error (MSE) as the metric. For each possible s, \(\tilde{{{\varvec{p}}}}\) is first calculated according to (2) and then the MSE between \({{{\varvec{b}}}}\) and \(\tilde{{{\varvec{p}}}}\) is calculated. Finally, the best s that gives the smallest MSE is obtained. This problem can be solved by Algorithm 2.

figure b

It is very reasonable that the best window-size s should minimize the MSE between overall beliefs and local bias probabilities.

2.4 Complexity Analysis of SWBP

In the window size setting phase, for each possible s, about 4n additions/subtractions and n divisions are performed to compute \(\tilde{{{\varvec{p}}}}\). Then n subtractions and n multiplications are performed to compute each \(({b_i} - {\tilde{p}_i})^2\). Finally, n-1 additions and 1 division are performed to get \({\sigma }^2=\frac{1}{n}\sum _{i = 1}^n {{{({b_i} - {{\tilde{p}}_i})}^2}}\). Therefore, we need in total 8n operations of addition/subtraction/multiplication/division for each s. To find the best s, one needs to try each odd s between 1 to n, which needs \(h = \left\lfloor {s/2} \right\rfloor \) window size setting iterations, where \(\left\lfloor \cdot \right\rfloor \) denotes the flooring function. In [3], the minimum search step of different window-size was set to 20 to reduce the computing complexity. In our experiment, we find that this interval is so big that the best window-size s may be omitted. Thus we fix the minimum interval of searched window-size to 2.

In the local bias probability refinement phase, about 4n additions/subtractions and n division are performed to refine \(\tilde{{{\varvec{p}}}}\) in each BP iteration. Since the outputs of successive BP iterations are usually very similar, it is unnecessary to refine \(\tilde{{{\varvec{p}}}}\) after each BP iteration. In [3], source bias probability is re-estimated after every 10 BP iterations (except explicit declaration), which is a good tradeoff between performance and complexity. Compared to window size setting phase, it is clear that the computing complexity of this phase can be ignored.

According to the above analysis, the total computing complexity is \(\mathcal{O}\mathrm{{(}}4{n^2}\mathrm{{)}}\) in each SWBP iteration. The bottleneck of the SWBP lies in window size setting phase (standard BP is ignored in this paper). It is obvious that for long LDPC codes (n is large), SWBP is an algorithm with heavy computing complexity.

To verify aformentioned analysis, we employ a regular LDPC code with length 2000 as the input of SWBP algorithm, and measure the running time of different modules. The test result is listed in Table 1, which shows that selecting best window size is the bottleneck .

Table 1 Running time of different parts of SWBP algorithm

2.5 Features of GPUs

Originally, GPUs are designed to accelerate the creation of images for output to a display device. With its rapid progress, highly parallel structure of modern GPUs makes them more efficient to compute the large block of data in parallel. For example, with its 3584 CUDA cores, NVIDIA GTX 1080Ti has powerful computing ability. One GPU has thousands of threads. All threads can run in parallel. The memory architecture of the GPU is depicted in Fig. 1. The Global, Local and Texture memory has larger capacity with lower speed while Shared memory and registers are fast but scarce. To optimize program, we should take care of memory allocation.

Fig. 1
figure 1

GPU memory architecture

Meanwhile, NVIDIA presented CUDA as a parallel computing platform and application programming interface. The CUDA platform is designed to work with C-like programming languages. Therefore, software engineers can easily use CUDA to development parallel program to use GPU resources. On CUDA platform, we can access the compute kernels which is the parallel computational elements. CUDA compute capability version specifies the maximum parallel elements, such as threads and blocks.

2.6 Testing Platform

In our experiments, two different test platforms will be used to investigate the speed up effect. The sequential SWBP will run on CPU platform and the parallel SWBP will run on GPU platform, respectively. The specifications of these two platforms are listed in Table 2.

Table 2 Specification of CPU platform and GPU platform

3 Parallel SWBP

3.1 Algorithm

In sequential SWBP algorithm, each window size setting iteration will generate one MSE \(\sigma ^2\) and each \(\sigma ^2_i\) is calculated by \(b_i\), \(s_i\) and \(p_i\). Therefore, any two MSEs \(\sigma ^2_i\) and \(\sigma ^2_j\) (\(i\not =j\)) have no data correlation and can be simultaneously calculated. In our parallel SWBP, all \(\sigma ^2_i\)’s will be calculated simultaneously by thousands of threads on GPU. These two algorithms are illustrated in Fig. 2.

Fig. 2
figure 2

a Sequential SWBP algorithm. b Parallel SWBP algorithm

3.2 Coalesced Memory Access

In our SWBP algorithm, the input is overall belief vector b, which is stored in global memory of GPU. Because global memory has large capacity, it can be easily made use of by large array. But global memory is an off-chip memory, much slower than on-chip memory. In order to reduce the access time to global memory while reading b, coalesced memory access should be considered. Instead of performing 16 individual memory accesses, all the 16 threads of a half-wrap should access the global memory of GPU in a single read. The elements of b have to lie on a contiguous memory block, where the kth thread accesses the kth data element. The principle of coalesced memory access is illustrated in Fig. 3.

Fig. 3
figure 3

Coalesced memory access

3.3 Using Shared Memory

Shared memory is an on-chip memory with lower latency and high bandwidth. The speed of shared memory is the same as that of registers and hundreds times faster than that of global memory. Note that the capacity of shared memory is small and has only 48KB per block. The shared memory should be allocated to variables carefully in order not to exceed the capacity. We can find in Fig. 3 that all threads in one block can access variables in the same shared memory. Therefore, we define an array in shared memory, and two variables in registers:

figure c

where b[n] loads the overall belief stored in global memory into shared memory. bias stores the local bias probability \(\tilde{p}_i\) calculated by Eq. (2), and sum stores the current sum of \((b_i-\tilde{p}_i)\). Since these two variables are frequently used in each thread, the total access time will be reduced. \(\sigma ^2_i\) will be calculated by accumulating sum. Because \(\sigma ^2_i\) is only used for comparison purpose, the operation of division by n can be ignored.

3.4 Find Best Window Size

In global memory, we define an array \(dev\_sigma2\) with length n to store all \(\sigma ^2_i\)’s, each of which is calculated by one thread. The best window size s should give the smallest \(\sigma ^2_i\). Then the problem becomes finding the smallest element of array \(dev\_sigma2\). To solve this problem, we investigated three parallel algorithms: (1) thrust; (2) cuBLAS; and (3) reduction kernel algorithm.

3.4.1 Thrust

Thrust is a powerful library of parallel algorithms and data structures [10]. Thrust provides a flexible, high-level interface for GPU programming. In thrust library, \(min\_element()\) function could return the minimum element of array. Therefore, we use the following code to call thrust library, where the variable index is the index of the minimum element.

figure d

3.4.2 cuBLAS

The cuBLAS [9] library is a fast GPU-accelerated implementation of the standard Basic Linear Algebra Subroutines (BLAS). In cuBLAS library, cublasIsamin() function will return the minimum element of array. Therefore, we use the following code to call cuBLAS library, where the variable index is the index of the smallest element.

figure e

3.4.3 Reduction Kernel

To achieve the fastest speed, we design a reduction search algorithm. We implement a kernel function \(min\_kernel()\) and present the code as following.

figure f

In our main() function, we call \(min\_kernel()\) to get the index of the smallest element.

figure g

In Sect. 4, codes of different lengths are used in our experiments. It is found that our own designed reduction kernel performs the best.

Table 3 Regular LDPC code parameters (N is codeword length, K is information bit number)
Table 4 Parallel SWBP running time with different algorithms
Fig. 4
figure 4

Parallel SWBP running time with different algorithms

4 Experiment Results

Our experiment platforms are listed in Table 2. We will perform 3 experiments to investigate the speed up effect of our parallel SWBP algorithm. The first and second experiments will use regular LDPC codes as input. The parameters of LDPC code are listed in Table 3.

In our first experiment, we compare 3 parallel versions of SWBP with different algorithms listed in Sect. 3.4. In order to eliminate randomness, we perform 200 times tests and calculate the average running time. Our experiment results are listed in Table 4 and illustrated in Fig. 4. The experiment result shows that our own designed reduction kernel algorithm achieves the fastest speed. Thrust algorithm is the slowest and cuBLAS algorithm is in between. Although the speed of cuBLAS is almost the same as that of reduction kernel, it should be noticed that we only count the running time of function cublasIsamin(), while that of functions cublasCreate() and cublasDestroy() is not included. In fact, these two functions cost about 100 milliseconds on this platform.

Table 5 Running time and speedup ratio
Fig. 5
figure 5

Running time and speedup ratio (regular LDPC code)

In our second experiment, we compare sequential SWBP and parallel SWBP algorithm. Since reduction kernel has achieved the fastest performance, we select parallel SWBP with reduction kernel as the representative of parallel SWBP. Our experiment results are listed in Table 5 and illustrated in Fig. 5. The experiment results show that our parallel SWBP algorithm obtained 14 \(\times \) to 118 \(\times \) speedup ratio for different LDPC codeword lengths, and as codeword length increases, the speedup ratio rises tremendously. According to the trend of Fig. 5, we believe that if we use longer LDPC codes, higher speedup ratio can be obtained.

In our last experiment, we use irregular LDPC codes as the input to investigate our parallel SWBP algorithm. The parameters of irregular LDPC code are listed in Table 6. Our experiment results are listed in Table 7 and illustrated in Fig. 6. The experiment results show that the parallel SWBP obtained 8 \(\times \) to 120 \(\times \) speedup ratio and the changing trend of speedup ratio is the same as that of regular LDPC codes.

Table 6 Irregular LDPC code parameters (N is codeword length, K is information bit number)
Table 7 Runnning time and speedup ratio (irregular LDPC code)
Fig. 6
figure 6

Running time and speedup ratio (irregular LDPC code)

5 Conclusion

We proposed a parallel SWBP algorithm to decode LDPC codes. This algorithm was implemented on CUDA platform and accelerated by NVIDIA GTX 1080Ti GPU. Different from sequential SWBP, parallel SWBP simultaneously estimates the metrics of different window sizes by thousands of threads of GPU. By taking good care of memory architecture of GPU, the reading and writing time were also reduced. We carefully design a reduction kernel to find the smallest element of a long array in parallel, and this algorithm achieved better performance than thrust and cuBLAS algorithms.

To investigate the speedup effect, we use CPU and GPU platforms for sequential SWBP and parallel SWBP respectively. The experiment results show that parallel SWBP achieved about 14 \(\times \) to 118 \(\times \) speedup ratio for different regular LDPC codes, and about 8 \(\times \) to 120 \(\times \) speedup ratio for different irregular LDPC codes. From the trend of above experiments, we expect higher speedup ratio for longer LDPC codes.

All source codes of this paper can be found in [11]. Readers can download it for academic purpose.