Keywords

1 Introduction

As a very good error-correcting code [1], Low-Density-Parity-Codes (LDPC) codes are wildly used in Fifth Generation (5G) telecommunication and Internet of Things as s Service (IoTaaS). It was first invented by Gallager [2] in 1962, and unfortunately ignored by information society for more than 30 years. The renaissance of LDPC was triggered by MacKay and Neal in 1996.

Originally, binary LDPC has been decoded by belief propagation (BP) algorithm (also known as “sum-product” algorithm) [1]. In 1998, MacKay \(et\ al.\) [3] generalized the binary LDPC to finite fields GF (\(Q=2^q\)) and proposed a Q-ary LDPC. The Q-ary BP algorithm is used to decode Q-ary LDCP. To improve the performance of BP algorithm, Fang presented a Sliding-Window Belief Propagation (SWBP) algorithm. A lot of experiments [6] show that SWBP achieves better performance with less iteration times. In addition, it is very easy to implement and insensitive to the initial settings. Incorporating fast Q-ary BP with SWBP is a nature way to attain better performance and robustness, while it still suffers from heavy computing complexity.

Invented by NVIDIA, Graphics Processing Unit (GPU) [7] has demonstrated powerful ability for general-purpose computing. NVIDIA also presented a C-like languages interface named Compute Unified Device Architecture (CUDA) which is a useful tool for researchers to develop the parallel algorithms. Inspired by GPU’s amazing ability, we propose a parallel algorithm of Q-ary SWBP and accelerate it by GTX 1080Ti. Although we has used GPU to accelerate parallel binary SWBP algorithm developed by CUDA C++ [8], to our best knowledge, parallel Q-ary SWBP algorithm has still not been presented.

The rest of this paper is organized as: Sect. 2 describes the sequential Q-ary SWBP algorithm. Section 3 analyses the bottlenecks of sequential algorithm and presents the parallel algorithm. Section 4 uses GPU to accelerate the parallel algorithm and gives the experiment results. We concludes this paper in Sect. 5.

2 Q-ary SWBP Algorithm

2.1 Correlation Model

Let \(\mathcal{A} = [0:Q)\) denote the alphabet. Let \(x,y \in \mathcal{A}\) denote the realization of X and Y, which are two random variables. Let \({X^n}\) be the source to be compressed at the encoder. Let \({Y^n}\) be the Side Information (SI) that resides only at the decoder. Let \({X^n} = {Y^n} + {Z^n}\). We model the correlation between input \({X^n}\) and output \({Y^n}\) as a virtual channel with three properties:

  1. (1)

    Additive: \({Y^n}\) and \({X^n}\) are independent with each other;

  2. (2)

    Memoryless: \({p_{{Z^n}}}({z^n}) = \prod \nolimits _{i = 1}^n {{p_{{Z_i}}}({z_i})} \), where \({p_X}(x)\) denotes the Probability Mass Function (pmf) of discrete random variable X;

  3. (3)

    Nonstationary: pmfs of \(Z_i\)’s may be different, where \(i \in [0:n]\).

We use Truncated Discrete Laplace (TDL) distribution to model \(Z_i\):

$$\begin{aligned} {p_{{X_i}\mathrm{{|}}{Y_i}}}\left( {x|y} \right) \propto \frac{1}{{2{b_i}}}\exp \left( { - \frac{{|x - y|}}{{{b_i}}}} \right) \end{aligned}$$
(1)

where \(b_i\) is the local scale parameter. Since \(\sum \nolimits _{x = 0}^{Q - 1} {{p_{{X_i}\mathrm{{|}}{Y_i}}}\left( {x|y} \right) } = 1\), we can obtain

(2)

where . To reduce the computing complexity, we use integration to approximate the summation. When b and Q are reasonably big, this approximation is precise enough by

$$\begin{aligned} \begin{array}{l} {\mathcal{L}_Q}(b,y) \approx \int _0^{Q - 1} {\exp \left( { - \dfrac{{|x - y|}}{b}} \right) } dx \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\mathrm{{ = 2}}b\left( {1 - \dfrac{1}{2}\exp \left( {\dfrac{{y - (Q - 1)}}{b}} \right) \mathrm{{ - }}\dfrac{1}{2}\exp \left( { - \dfrac{y}{b}} \right) } \right) \end{array} \end{aligned}$$
(3)

The encoder uses Q-ary LDPC codes to compress source \(\mathrm{{x}} \in {[0:Q)^n}\) to get syndrome \(\mathrm{{s}} \in {[0:Q)^n}\). The decoder seeds source nodes x according to SI y, and runs Q-ary BP algorithm to recover x. For the belief propagation between source nodes and syndrome nodes, we give following definitions: \({\xi _i}(x)\) is intrinsic pmf of source node \(x_i\); \({\zeta _i}(x)\) is overall pmf of source node \(x_i\); \(r_{i,j}(x)\) is the pmf passed from source node \(x_i\) to syndrome nodes \(s_j\); and \(q_{j,i}(x)\) is the pmf passed from syndrome nodes \(s_j\) to source node \(x_i\), where \(j \in {\mathcal{M}_i}\) and \(i \in {\mathcal{N}_j}\). The encoding and decoding process of \(Q-ary\) BP has been stated in [6], and we omit it in this paper.

2.2 SWBP Algorithm

In Q-ary BP, the source nodes need be seeded with local scale parameter b of virtual correlation channel. In [4] and [5], the parameter of virtual correlation channel is estimated by SWBP algorithm. In this paper, we will use expected \(L_1\) distance between each source symbol and its corresponding SI symbol defined as

$$\begin{aligned} {\mu _i} \buildrel \varDelta \over = \sum \limits _{x = 1}^Q {({\zeta _i}(x) \cdot |x - {y_i}|)} \end{aligned}$$
(4)

Then, the estimated local scale parameter \(\hat{b}\) is calculated by averaging the expected \(L_1\) distances of its neighbors in a window with size-\((2\eta + 1)\)

$$\begin{aligned} {\hat{b}_i}(\eta ) = \frac{{{t_i}(\eta ) - {\mu _i}}}{{\min (i + \eta ,n) - \max (1,i - \eta )}} \end{aligned}$$
(5)

where

$$\begin{aligned} {t_i}(\eta ) \buildrel \varDelta \over = \sum \limits _{i' = \max (1,i - \eta )}^{\min (i + \eta ,n)} {{\mu _{i'}}} \end{aligned}$$
(6)

To calculate (13), we first calculate \({t_1}(\eta )\mathrm{{ = }}\sum \limits _{i' = 1}^{1 + \eta } {{u_{i'}}} \) . Then for \(i \in [2:n]\),

$$\begin{aligned} {t_i}(\eta )\mathrm{{ = }}\left\{ \begin{array}{l} {t_{i - 1}}(\eta ) + {\mu _{i + \eta }},\;\;\;\;\;\;\;\;\;\;\;\;\;\;i \in [2:(\eta + 1)] \\ {t_{i - 1}}(\eta ) + {\mu _{i + \eta }} - {\mu _{i - 1 - \eta }},\;\;\;i \in [(\eta + 2):(n - \eta )] \\ {t_{i - 1}}(\eta ) + {\mu _{i - 1 - \eta }},\;\;\;\;\;\;\;\;\;\;\;\;i \in [(n - \eta + 1):n] \\ \end{array} \right. \; \end{aligned}$$
(7)

Same as [4] and [5], the main purpose of SWBP is to find a best half window size \(\hat{\eta }\). We define an expected rate:

(8)

where \({\mathcal{L}_Q}({\hat{b}_i}(\eta ),{y_i})\) is defined by (3). The best half window size \(\hat{\eta }\) is chosen by

$$\begin{aligned} \hat{\eta }\mathrm{{ = }}\mathop {\arg \min }\limits _\eta \gamma (\eta ), \end{aligned}$$
(9)

It is a natural idea that best half window size should minimize the expected rate. The flowchart of Q-ary SWBP algorithm was illustrated in Fig. 1.

Fig. 1.
figure 1

Flowchart of Q-ary SWBP

Fig. 2.
figure 2

(a) Sequential SWBP algorithm, (b) parallel SWBP algorithm

3 Parallel SWBP Algorithm

In sequential SWBP algorithm, each window size setup iteration generates an expected rate \(\gamma (\eta )\), which is calculated by (15). Any two expected rate \(\gamma ({\eta _1})\) and \(\gamma ({\eta _2})\) (\({\eta _1} \ne {\eta _2}\)) are uncorrelated, and can be computed in parallel. In our parallel algorithm, all \(\gamma (\eta )\), \(\eta \in \left\{ {1,2, \ldots ,\left\lfloor {\frac{{n - 1}}{2}} \right\rfloor } \right\} \) are calculated simultaneously by thousands of threads on GPU. Once \(\gamma (\eta )\), \(\eta \in \left\{ {1,2, \ldots ,\left\lfloor {\frac{{n - 1}}{2}} \right\rfloor } \right\} \) are obtained, we use min() function in MATLAB to get the smallest \(\gamma \) and corresponding best \(\eta \) from array \(\gamma (\eta )\). The sequential and parallel algorithm are illustrated in Fig. 2.

4 Experiment Results

Fig. 3.
figure 3

Running time and speedup ratio under Q=256

In our experiments, we use Intel Core i7 with 3.60Ghz as our CPU and NVIDIA GTX 1080Ti as our GPU. We use MATLAB 2014b as our development platform. We perform numerical experiments to evaluate the performance of our parallel algorithm.

We set Q=256, and use 4 different regular LDPC codes as our input. The parameters of these LPDC codes are listed in Table 1. To eliminate the random errors, we perform 100 tests and average these outputs as our final results. The experiment result is illustrated in Fig. 3, which shows that parallel Q-ary SWBP algorithm achieves 2.9\(\times \) to 30.3\(\times \) accelerating ratio than sequential Q-ary SWBP algorithm. The longer the codeword length, the higher the accelerating ratio.

Table 1. Different LDPC code parameters (N is codeword length, K is information bit number)

5 Conclusion

We propose a parallel Q-ary SWBP algorithm to decode regular LDPC codes with different codelength and Q value. This algorithm can address the bottlnecks of sequential Q-ary SWBP and be accelerated by GPU. Experiment results show that parallel algorithm achieves 2.9\(\times \) to 30.3\(\times \) speedup ratio under Q=256. The longer codeword length leads to higher speedup ratio.