1 Introduction

Distributed Video Coding (DVC) [7] is an advanced asymmetric coding scheme which encodes individual frame independently while decodes them conditionally. This feature makes DVC widely applied in mobile device environment. As a capacity-approaching channel codes, Low-Density-Parity-Codes (LDPC) codes [6, 9] have been used to compress video under DVC framework [15]. LDPC codes were first invented by Gallager [6] in 1962, and rediscovered by MacKay and Neal [9] in 1996. Thereafter, it has become one of the hottest topic for both research and industrial community.

In 1998, MacKay et al. [3] generalized the binary LDPC to finite fields GF(Q = 2q) and proposed a Q-ary LDPC (QLDPC). QLDPC provides a better choice for practical multimedia problems, e. g., video and image compression, as one pixel is normally represented by 8 bits at least. Most decoders of LDPC codes are implemented by Belief Propagation (BP) algorithm [8] (also known as “sum-product” algorithm). The Q-ary BP (QBP) algorithm was used to decode QLDPC. The QBP has a computing complexity O(NtQ2), where N is the codeword length, and t is the mean column weight of sparse parity check matrix H. To reduce its computing burden, Declercq et al. [1] proposed a fast Q-ary BP algorithm, whose idea is to replace the convolutional operations by Fast Fourier Transform (FFT) and reduces the complexity to \( O\left( NtQ{\log}_2(Q)\right) \).

To improve the performance of BP algorithm, Fang [4] presented a Sliding-Window Belief Propagation (SWBP) algorithm, whose idea is to adaptively select optimal local bias probabilities to seed the variable nodes of BP. A lot of experiments [4, 5] showed that SWBP could achieve better performance with less iteration times. In addition, it is very easy to implement and insensitive to the initial settings. Recently, Q-ary SWBP (QSWBP) [14] has been proposed to deal with the QLDPC codes and achieved better performance and robustness, while it still suffers from heavy computing complexity.

Graphics Processing Unit (GPU) [10] invented by NVIDIA, by means of highly parallel structure, has demonstrated powerful ability for high performance computing. Inspired by GPU’s amazing ability, we propose a parallel version of QSWBP and accelerate it by GPU. In 2016, the joint-bitplane BP has been accelerated by GPU [2]. In 2019, A parallel binary SWBP algorithm has been accelerated by GPU and obtained remarkable speedup ratio [12]. To our best knowledge, parallel QSWBP algorithm has still not been presented. Instead of C/C++, we will use MATLAB as our programming platform in this paper. As a high-level language for scientific computing and rapid prototyping engineering problems, MATLAB has many advantages, e. g.eliminating pointer to avoid memory access error; powerful manipulation of vector and matrix; and concise and efficient vectorization instructions. Since 2010, MATLAB has introduced supports to GPU [11]. We use our parallel QSWBP algorithm to decode a small fractue of video under DVC framework. The numertical experiments are performed to investigate the accelerating effects between parellel and sequence algorithm. A brief version of this paper has been published in IoTaaS 2019 conference [13], and we hereby present the detaild discussions and the application of this algorithm.

2 QSWBP algorithm

2.1 Correlation model

Let A = [0 : Q) denote the alphabet. Let x, yA denote the realization of X and Y, which are two random variables. Let Xn be the source to be compressed at the encoder. Let Yn be the Side Information (SI) that resides only at the decoder. Let Xn = Yn + Zn. We model the correlation between input Xn and output Yn as a virtual channel with following properties: Yn and Zn are independent with each other; \( {p}_{Z^n}\left({z}^n\right)={\prod}_{i=1}^n{p}_{Z_i}\left({z}_i\right) \), where pX(x) denotes the Probability Mass Function (pmf) of discrete random variable X; pmfs of Zi’s may be different, where i ∈ [0 : n].

We use Truncated Discrete Laplace(TDL) distribution to model Zi:

$$ {p}_{X_i\mid {Y}_i}\left(x|y\right)\propto \frac{1}{2{b}_i}\exp\ \left(-\frac{\mid x-y\mid }{b_i}\right) $$
(1)

where bi is the local scale parameter. Since \( {\sum}_{x=0}^{Q-1}{p}_{X_i\mid {Y}_i}\left(x|y\right)=1 \), we can obtain

$$ {p}_{X_i\mid {Y}_i}\left(x|y\right)=\exp\ \left(-\frac{\mid x-y\mid }{b_i}\right){L}_Q\left({b}_i,y\right) $$
(2)

where \( {L}_Q\left(b,y\right)={\sum}_{x=0}^{Q-1}\exp\ \left(-|x-y|{b}_i\right) \) . To reduce the computing complexity, we use integration to approximate the summation. When b and Q are reasonably big, this approximation is precise enough by

$$ {\displaystyle \begin{array}{l}{L}_Q\left(b,y\right)\approx {\int}_0^{Q-1}\exp\ \left(-\frac{\mid x-y\mid }{b}\right) dx\\ {}\kern3.5em =2b\left(1-\frac{1}{2}\exp\ \left(\frac{y-\left(Q-1\right)}{b}\right)-\frac{1}{2}\exp\ \left(-\frac{y}{b}\right)\right)\end{array}} $$
(3)

2.2 Encoding

The encoder uses QLDPC codes to compress source x ∈ [0 : Q)n to get syndrome s ∈ [0 : Q)n. This process is performed by matrix-vector multiplication over the finite field GF(Q):

$$ \mathrm{s}=\mathbf{H}\mathrm{x} $$
(4)

where H ∈ [0 : Q)m × n is the sparse parity-check matrix. In H, i-th column corresponds to source node xi, and j-th row corresponds to syndrome node si. If the elementary of H hj, i ≠ 0, an edge connects si and xi in the bipartite graph of H, as illustrated in Fig. 1. We define the indices of all source nodes that are connected to syndrome node sj as \( {N}_j\overset{\Delta}{=}\le ft\Big\{i:{h}_{j,i}\ne 0\subset \left[1:n\right] \), and indices of all syndrome nodes that are connected to source node xi as \( {M}_i\overset{\Delta}{=}\left\{j:{h}_{j,i}\ne 0\right\}\subset \left[1:m\right] \).

Fig. 1.
figure 1

(2,4)-regular QLDPC code of length 10. (a) The bipartite graph, where circles represent source nodes and squares represent syndrome nodes; (b) parity check matrix.

2.3 Decoding

The decoder seeds source nodes x according to SI y, and runs QBP algorithm to recover x. For the belief propagation between source nodes and syndrome nodes, we give following definitions: ξi(x) is intrinsic pmf of source node xi; ζi(x) is overall pmf of source node xi; ri, j(x) is the pmf passed from source node xi to syndrome nodes sj; and qj, i(x) is the pmf passed from syndrome nodes sj to source node xi, where jMi and iNj.

BP includes 5 steps:

  1. 1.

    Initializing BP:

$$ {\xi}_i(x)={\zeta}_i(x)={p}_{X_i\mid {Y}_i}\left(x|y\right) $$
(5)
$$ {q}_{j,i}(x)=1/Q $$
(6)

where \( {p}_{X_i\mid {Y}_i}\left(x|y\right) \) is calculated by (3)

  1. 2.

    Source-to-Syndrome BP:

$$ {r}_{i,j}(x)\propto \frac{\zeta_i(x)}{q_{j,i}(x)} $$
(7)
  1. 3.

    Syndrome-to-Source BP:

$$ {q}_{j,i}\left({h}_{j,i}x\right)={F}^{-1}\left\{\frac{\psi_j(w)}{F\left\{{r}_{i,j}\left({h}_{j,i}x\right)\right\}}\right\} $$
(8)

where \( {\psi}_j(w)=\prod \limits_{i\in {I}_j}F\left\{{r}_{i,j}\left({h}_{j,i}x\right)\right\} \) for j ∈ [1 : m] and w ∈ [0 : Q), F{·} denotes the Fourier Transform, and F−1{·} denotes the inverse Fourier Transform.

  1. 4.

    Overall pmf of Source Nodes:

$$ {\zeta}_i(x)={\xi}_i(x)\prod \limits_{i\in {J}_i}{q}_{j,i}(x) $$
(9)
  1. 5.

    Hard Decision and Convergence Test:

$$ {\hat{x}}_i=\underset{x\in \left[0:Q\right)}{\arg\ \max\ {\zeta}_i}(x) $$
(10)

If \( \mathrm{s}=\mathbf{H}\hat{\mathrm{x}} \), the decoding process finished successfully; otherwise, more iterations need be performed until either \( \mathrm{s}=\mathbf{H}\hat{\mathrm{x}} \) or the iteration times exceed a prespecified threshold.

2.4 SWBP algorithm

In QBP, the source nodes need be seeded with local scale parameter b of virtual correlation channel. In [1, 8], the parameter of virtual correlation channel is estimated by SWBP algorithm. In this paper, we will use expected L1 distance between each source symbol and its corresponding SI symbol defined as

$$ {\mu}_i\overset{\Delta}{=}\sum \limits_{x=1}^Q\left({\zeta}_i(x)\cdotp |x-{y}_i|\right) $$
(11)

Then, the estimated local scale parameter \( \hat{b} \) is calculated by averaging the expected L1 distances of its neighbors in a window with size-(2η + 1)

$$ {\hat{b}}_i\left(\eta \right)=\frac{t_i\left(\eta \right)-{\mu}_i}{\mathit{\min}\ \left(i+\eta, n\right)-\mathit{\max}\ \left(1,i-\eta \right)} $$
(12)

where

$$ {t}_i\left(\eta \right)\overset{\Delta}{=}\sum \limits_{i^{\prime }=\max\ \left(1,i-\eta \right)}^{\min\ \left(i+\eta, n\right)}{\mu}_{i^{\prime }} $$
(13)

To calculate (13), we first calculate \( {t}_1\left(\eta \right)=\sum \limits_{i^{\prime }=1}^{1+\eta }{u}_{i^{\prime }} \) . Then for i ∈ [2 : n],

$$ {t}_i\left(\eta \right)=\left\{\begin{array}{l}{t}_{i-1}\left(\eta \right)+{\mu}_{i+\eta },\kern3.5em i\in \left[2:\left(\eta +1\right)\right]\\ {}{t}_{i-1}\left(\eta \right)+{\mu}_{i+\eta }-{\mu}_{i-1-\eta },\kern0.75em i\in \left[\left(\eta +2\right):\left(n-\eta \right)\right]\\ {}{t}_{i-1}\left(\eta \right)+{\mu}_{i-1-\eta },\kern3em i\in \left[\left(n-\eta +1\right):n\right]\end{array}\right. $$
(14)

Same as [1, 8], the main purpose of QSWBP is to find a best half window size \( \hat{\eta} \). We define an expected rate:

$$ {\displaystyle \begin{array}{l}\gamma \left(\eta \right)\overset{\Delta}{=}-\sum \limits_{i=1}^n\sum \limits_{x=0}^{Q-1}{\zeta}_i(x)\cdotp \ln\ \frac{\mathit{\exp}\ \left(-|x-{y}_i|{\hat{b}}_i\left(\eta \right)\right)}{L_Q\left({\hat{b}}_i\left(\eta \right),{y}_i\right)}\\ {}\kern1.75em =\sum \limits_{i=1}^n\left(\ln\ {L}_Q\Big({\hat{b}}_i\left(\eta \right),{y}_i\Big)+\frac{\mu_i}{{\hat{b}}_i\left(\eta \right)}\right)\end{array}} $$
(15)

where \( {L}_Q\left({\hat{b}}_i\left(\eta \right),{y}_i\right) \) is defined by (3). The best half window size \( \hat{\eta} \) is chosen by

$$ \hat{\eta}=\underset{\eta }{\arg\ \min\ \gamma}\left(\eta \right), $$
(16)

It is a natural idea that best half window size should minimize the expected rate. The flowchart of QSWBP algorithm is illustrated in Fig. 2.

Fig. 2
figure 2

Flowchart of QSWBP

3 Parallel QSWBP algorithm

3.1 Complexity analysis

In Fig. 2, the decoding process includes 4 parts.

  • The Source-to-Syndrome BP step needs n times iterations to calculate ri, j(x) Since there are n source nodes.

  • The Syndrome-to-Source BP step needs m times iterations to calculate qj, i(x) since there are m syndrome nodes. Let rwj denotes the j-th row weight, in each iteration, the Fourier Transform needs be calculated for rwj times and inverse Fourier Transform also needs be calculated for rwj times. We use Fast Fourier Transform(FFT) to implement Fourier Transform and inverse Fourier Transform. If the codeword length is n, each FFT needs nlog2(n) real multiplies and real adds.

  • The Computing Overall pmf step needs n times iterations to calculate ζi(x).

  • In SWBP step, to find the best half window size, a search strategy [5] was proposed that only searches half window size from \( \eta \in \left\{{1}^2,\dots, {\left\lfloor \sqrt{\frac{n-1}{2}}\right\rfloor}^2\right\} \). Although this strategy could remarkably reduce the searching iterations, we found that the best half window size might be omitted according to our experiment results. Therefore, we will evaluate all expected rates from \( \eta \in \left\{1,2,\dots, \left\lfloor \frac{n-1}{2}\right\rfloor \right\} \) which needs \( \left\lfloor \frac{n-1}{2}\right\rfloor \) iterations. In practice, since γ(1) is always big enough, it is obviously unnecessary to calculate it while η = 1. In each SWBP iteration, getting ti(η) needs n iterations in (14) and getting γ(n) also needs n iterations in (15) in which the ln functions is very time consuming.

Based on above analysis, we conclude that syndrome-to-source BP step and SWBP step are two major time consuming parts in the entire algorithm. To verify our conclusion, we use the profile function in MATLAB to investigate the details of time consuming. Profile function [10] could record the executive time of each function in MATLAB code. The time consuming result is shown in Fig. 3, where qldpc _ test is the main function which costs 5.548 s, ntt is the FFT function which is called for 8192 times and totally cost 3.912 s, and swbp _ lap is the SWBP function which is called for 7 times and totally cost 1.607 s. Profile analysis means FFT and SWBP are two major bottlenecks which agrees well with ours analysis. Since one GPU has large amount of cores, which can run many threads simultaneously, we take advangtage of this feature of GPU to accelerate above bottlenecks. In the our parallel algorithm, the bottleneck is divided into many pieces, which has no corelation with each others, and each piece can run on one core of GPU. The details of parallel algorithm are introduced in next two subsections.

Fig. 3
figure 3

Time consuming details of sequantial Q-ary SWBP

3.2 Parallel syndrome-to-source BP algorithm

The sequential Syndrome-to-Source BP algorithm needs m iterations to calculate qj, i(x) as there are m syndrome nodes. Since these m iterations are independent with each other, they can be calculated in parallel. We take Fig. 1(b) as an example. Figure 4 depicts our parallel algorithm, where the number in square means the row position of non-zero elements in parity-check-matrix H of Fig. 1(b). 5 threads run simultaneously on GPU to calculate each qj, i(x) and each thread calculates FFT in sequence.

Fig. 4
figure 4

Parallel Syndrome-to-Source BP algorithm

3.3 Parallel SWBP algorithm

In sequential SWBP algorithm, each window size setup iteration could generate an expected rate γ(η), which is calculated by (15). Any two expected rate γ(η1) and γ(η2) (η1η2) are uncorrelated, and can be computed in parallel. In our parallel algorithm, all γ(η), \( \eta \in \left\{1,2,\dots, \left\lfloor \frac{n-1}{2}\right\rfloor \right\} \) would be calculated simultaneously by thousands of threads on GPU. Once γ(η), \( \eta \in \left\{1,2,\dots, \left\lfloor \frac{n-1}{2}\right\rfloor \right\} \) was obtained, we could use min() function in MATLAB to get the smallest γ and corresponding best η from array γ(η). The sequential and parallel algorithm are illustrated in Fig. 5.

Fig. 5.
figure 5

(a) sequential SWBP algorithm, (b) parallel SWBP algorithm.

3.4 Vectorization

Thanks to the features of MATLAB language, we could manipulate matrix and vector by vectorizing code instead of loop-based code. We take calcualting (11) for example. Listing 1 is our MATLAB implementation of (11) by loop-based code and vectorizing code, repectively. From these codes, we could find vectorization has three advantages: less code means less error; it’s easier to understand since vectorizing code appears like the mathematical expressions in equations; and it runs much faster than loop-based code since MATLAB is optimized for vectorization. By using vectorization, the code appears more concise and elegents. Our vectorized source code has reduced 10% length than loop-based code.

figure a

Listing 1 MATLAB code of (11)

4 Video compression by QLDPC

Under DVC framework, the encoder is implemented by Pixel-Domain and Transform-Domain (PDTD) scheme. PDTD divides video frames into Key Frame, which is encoded and decoded using a conventional intraframe codec, and Wyner-Ziv Frames. A block-wise DCT is performed for Wyner-Ziv frames to obtain the transform coefficients XDCT, which are independently quantized and grouped into coefficient bands. These bands are compressed by LDPC encoder. At the decoder, the correlation between XDCT and side information SDCT is modeled as a Laplacian distribution. LDPC decoder reconstructs the coefficient bands with the corrsponding side information and performs a inverse DCT to generate the reconstructed Wyner-Ziv frames.

One frame in the video is normally represented by an n × n 2D source. Let xn, n and yn, n be two n × n 2D sources, where

$$ {x}^{n,n}\triangleq \left(\begin{array}{ccc}{x}_{0,0}& \cdots & {x}_{0,n-1}\\ {}\vdots & \ddots & \vdots \\ {}{x}_{n-1,0}& \cdots & {x}_{n-1,n-1}\end{array}\right) $$
(17)

where xi, j ∈ [0 : Q). The correlation between them follows the setup in Section 2.1.

At the encoder, xn, n is first vectorized into an Q-ary temporary vector vn, n. Then vn, n is performed a matrix-vector multiplication over the finite field GF(Q) to compress source to syndrome sm:

$$ {s}^m=\boldsymbol{H}{v}^{n,n}. $$
(18)

where H is an m × (n × n) sparse parity-check matrix.

At the decoder, the QSWBP is performed with the help of syndrome and side information to recover the source.

5 Experiment results

In our experiments, we use Intel Core i7 with 3.60Ghz as our CPU and NVIDIA GTX 1080Ti as our GPU. The detailed parameters of this GPU are listed in Table 1. We used MATLAB 2014b as our development platform, since this version provides fully support to GPU acceleration.

Table 1 Parameters of GPU platform

5.1 Performance of parallel QSWBP

To evaluate the performance of our parallel algorithm, we perform two experiments with different Q.

In first experiment, we set Q = 256, and use 4 different regular LDPC codes as our input. The parameters of these LPDC codes are listed in Table 2. To eliminate the random errors, we perform 100 tests and average these outputs as our final results. The experiment result is illustrated in Fig. 6, which shows that parallel QSWBP algorithm achieves 2.9× to 30.3× accelerating ratio than sequential QSWBP algorithm. The longer the codeword length, the higher the accelerating ratio.

Table 2 Different LDPC code parameters(N is codeword length, K is information bit number)
Fig. 6
figure 6

Running time and speedup ratio under Q = 256

The second experiment uses first experiment’s LDPC codes as input, while we set Q = 2048. The experiment result is illustrated in Fig. 7, which shows that parallel QSWBP algorithm achieves 2.3× to 11.4× accelerating ratio than sequential QSWBP algorithm. The trend of accelerating ratio is same with the first one.

Fig. 7
figure 7

Running time and speedup ratio under Q = 2048

Normally, successful decoding needs 11 rounds BP iterations under Q = 256, but only 3 rounds BP iterations under Q = 2048. As a result, the totally running time of second experiment is shorter than that of first experiment. Because of the accumulating effects, first experiment achevieves higher accelerating performance than second experiment does.

5.2 Performance of video decoding

We borrow a YUV video sequence named Foreman from [16], and choose first 4 frames from Foreman as the source. Each frame is cropped to the size of 128×128 pixels. We construct a regular LDPC code with condeword length 16,384, and information bit number 8192. Then the rate is 1/2. The alphabet cardinality Q is fixed to 28 = 256.

Our early experiments have demonstrated that QSWBP outperforms QBP in video decoding [14]. In this paper, only the parallel and sequential QSWBP are evaluated by decoding the Foreman video. The running time is used as the metric to evaluate the performance of two algorithms. To eliminate the random errors, we perform 100 decoding processes and average these running times as our final results. The experimental results are listed in Table 3. Both parallel and sequental QSWBP obtain the same recovered frames which are displayed in Fig. 8.

Table 3 Performance Comparision Between Parallel and Sequential QSWBP Algorithms
Fig. 8
figure 8

Recovered 4 Frames of Foreman

From Table 3, we can find that, under different frames, parallel QSWBP achieved 69.21× to 78.31× speedup ratio. It is because that a very long LDPC code length (16384) need more iterations to search a best window size in the sequential QSWBP.

6 Conclusion

A parallel Q-ary SWBP algorithm to decode regular LDPC codes with different codelength and Q value has been proposed. We accelerate this algorithm with GPU and MATLAB vectorization technique. Experiment results show that parallel algorithm achieves 2.9× to 30.3× speedup ratio under Q = 256 and 2.3× to 11.4× speedup ratio under Q = 2048. The video decoding experiment shows that parallel algorithm achieves 69.21× to 78.31× speedup ratio under Q = 256. In our future work, we will implement the parallel algorithm on the FPGA platform to extend its applications.