Keywords

1.1 Introduction

Low density parity check (LDPC) codes are well suited for error-correction applications. However, the challenge is to find strategies that will enable efficient implementations while ensuring good performance. Iterative decoder designs with limited-precision quantization, suitable for digital logic implementation, appear in the works of T. Zhang and Parhi [1], and Planjery et al. [2], and Z. Zhang et al. [3].

In [4] we presented two 3-bit quantizations for a sum-product algorithm LDPC decoder on a Gaussian channel. This chapter expands upon the decoder design and refines the synthesis results. In our examinations of many 3-bit quantizations, our best choice of quantization changes as the channel conditions change. We propose an adaptive design that changes between our two selected quantizations based upon the channel condition.

Our experiments are with a rate-(1/2) length 1162 binary LDPC code; it is from a family of codes that our research group has generated using permutation matrices [5, 6]. The cyclic permutation structure is known to have efficient hardware implementations [79].

1.2 Scope

The sum product algorithm (SPA) was simulated on a computer cluster, using look-up tables based upon 3-bit quantization, for 10 iterations. We determine the per-iteration computational latency and evaluate trade-offs, between iterations and computation per-iteration, which contribute to total latency and total decoding gain. Our quantization, with 10 iterations, surpasses the performance of the decoder by Planjery et al. with 100 iterations. Gain versus latency is our comparison criteria, although we discuss other potential criteria. In an engineering application, the designer could attempt to maximize throughput or minimize power consumption.

We are particularly motivated to achieve low-latency decoding in the waterfall region. For voice communication and video streaming, a partial packet (with one or more uncorrected errors) is preferable to an entirely lost packet; and a correctly re-transmitted or excessively delayed out-of-order packet is useless. Moderate error rates (10−3–10−5 BER) in the content coming out from the decoder are acceptable for these applications.

1.2.1 Circulant Permutation Matrix

The experiments are performed with a quasi-cyclic LDPC code constructed using a method that yields permutation-based parity-check matrices with large girth [5]; this particular graph is constructed with a girth of 10. Here σ is an 83-by-83 matrix of the form

$$ {\sigma = }\left[ {\begin{array}{*{20}c} 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & {} & 0 \\ \vdots & {} & {} & \ddots & \vdots \\ 1 & 0 & 0 & \cdots & 0 \\ \end{array} } \right]. $$

We can create circular shifts: σ κ is the original submatrix circularly shifted by κ−1 positions, therefore σ 0 has the form of an identity matrix. H is a block parity-check matrix comprised of circulant permutations. The code, C, is the set of vectors in the null space of H. The 1162-bit length of C is in proximity to the lengths of LDPC codes in other research.

$$ {H = \left[ {\begin{array}{*{20}c} {\sigma^{ 0} } \hfill & {\sigma^{ 0} } \hfill & {\sigma^{ 0} } \hfill & 0\hfill & 0\hfill & 0\hfill & 0\hfill & {\sigma^{ 0} } \hfill & {\sigma^{ 0} } \hfill & {\sigma^{ 0} } \hfill & 0\hfill & 0\hfill & 0\hfill & 0\hfill \\ {\sigma^{ 0} } \hfill & 0\hfill & 0\hfill & {\sigma^{ 0} } \hfill & {\sigma^{ 0} } \hfill & 0\hfill & 0\hfill & {\sigma^{ 1} } \hfill & 0\hfill & 0\hfill & {\sigma^{ 0} } \hfill & {\sigma^{ 0} } \hfill & 0\hfill & 0\hfill \\ {\sigma^{ 0} } \hfill & 0\hfill & 0\hfill & 0\hfill & 0\hfill & {\sigma^{ 0} } \hfill & {\sigma^{ 0} } \hfill & 0\hfill & 0\hfill & {\sigma^{ 2} } \hfill & {\sigma^{ 3} } \hfill & 0\hfill & {\sigma^{ 0} } \hfill & 0\hfill \\ 0\hfill & {\sigma^{ 0} } \hfill & 0\hfill & {\sigma^{ 2} } \hfill & 0\hfill & {\sigma^{ 4} } \hfill & 0\hfill & 0\hfill & {\sigma^{ 9} } \hfill & 0\hfill & 0\hfill & {\sigma^{ 1 3} } \hfill & {\sigma^{ 1 6} } \hfill & 0\hfill \\ 0\hfill & {\sigma^{ 0} } \hfill & 0\hfill & 0\hfill & {\sigma^{ 5} } \hfill & 0\hfill & {\sigma^{ 1} } \hfill & {\sigma^{ 1 9} } \hfill & 0\hfill & 0\hfill & 0\hfill & 0\hfill & {\sigma^{ 1 1} } \hfill & {\sigma^{ 0} } \hfill \\ 0\hfill & 0\hfill & {\sigma^{ 0} } \hfill & 0\hfill & {\sigma^{ 6} } \hfill & {\sigma^{ 1 3} } \hfill & 0\hfill & 0\hfill & {\sigma^{ 3 2} } \hfill & 0\hfill & {\sigma^{ 4 0} } \hfill & 0\hfill & 0\hfill & {\sigma^{ 3 0} } \hfill \\ 0\hfill & 0\hfill & {\sigma^{ 0} } \hfill & {\sigma^{ 7} } \hfill & 0\hfill & 0\hfill & {\sigma^{ 1 7} } \hfill & 0\hfill & 0\hfill & {\sigma^{ 2 6} } \hfill & 0\hfill & {\sigma^{ 4 9} } \hfill & 0\hfill & {\sigma^{ 5 3} } \hfill \\ \end{array} } \right]} $$

1.2.2 FPGA Implementation

Because the field programmable gate array (FPGA) offers a very rapid pathway to concept verification, it fostered our exploration of the trade-offs between precision and computational speed. The application specific integrated circuit (ASIC) also offers customized precision and designer-defined variable data types that are not available in microprocessors, but at a high development cost. The FPGA synthesis results in this chapter, serve as an indicator of hardware complexity, size, and speed trade-offs; we anticipate that the comparisons of our designs in the FPGA domain would translate to proportional advantages in the ASIC domain. An FPGA solution [1] in the literature achieved LDPC decoding using operands with just 5-bits. Our own prior research [10] explored tradeoffs between the number of bits of precision and the number of decoding iterations. The regular LDPC decoder has a very repetitive structure; for our (6, 3)-regular code, each variable node outputs three update messages. We implemented the logic of one output message, determined the latency, and then implemented all three outputs in order to observe the consequent speed and size.

The Altera DE2 development board was selected for this work and requested from and provided to us by the Altera Corporation as a university research grant. The FPGA on the DE2 board is the Cyclone II EP2C35F672C6 N with 33,216 programmable logic elements.

1.2.3 Formulation of the Iterative Algorithm

Our quantization is applied to the computationally-efficient SPA formulation of [11]. Figure 1.1 shows the iterative algorithm formulations of three formulations of the SPA [1215] that we analyzed in the ISIT 2006 paper. Each is illustrated cycling through probability representations, where the variable bit-to-check (μ) messages and the check-to-bit (v) messages can be expressed in terms of probabilities, differences δp = P(0)−P(1), ratios ρp = P(0)/P(1), or log-likelihood ratios λp = log ρp. The δp representation transforms [0, 1] probability values to the range of [–1, +1].

Fig. 1.1
figure 1

Iterative formulations in the literature

Our two formulations, shown in Fig. 1.2, which represent probabilities as differences (δp) or as log-likelihood ratios (LLR) offered significant computational advantages by requiring fewer processor instructions [11]. Transforming multiplication operations into addition operations in the log domain also increases performance on computer processors with arithmetic logic units that can perform addition more rapidly than multiplication [1620]. The differences diminish when only a few bits of precision are in use.

Fig. 1.2
figure 2

Our proposed formulations of the sum-product algorithm

As Han and Sunwoo showed [21], the LLR calculations involve one particularly obstructive computation, an inverse hyperbolic tangent function; their limited-precision computation involves a lookup table for this calculation. Z. Zhang et al. have also looked at fixed-point LLR quantizations using 5, 6 and 7 bits [3]. In these implementations, the hyperbolic tangent function is a substantial part of the design effort and computational work. The algorithm formulations that we devised do not contain a hyperbolic tangent calculation.

In this paper, instead of looking at the parity check and variable-node update as two separate actions, we will present the cycle as a single computation with one quantization applied per iteration.

1.2.4 Comparing BSC and AWGN

This chapter compares decoding results on an additive white gaussian noise (AWGN) channel with competing published results that use the Binary Symmetric Channel (BSC). The BSC bit-crossover probability, α, can be determined from the Gaussian signal to noise ratio (SNR), E b /N 0 , by

$${\alpha = {\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 2}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$2$}}\,{\text{erfc}}(\sqrt { 2E_{b} /N_{ 0} } )}. $$

For decoders with floating-point belief propagation, there is an almost 2 dB difference in performance. As Fig. 1.3 shows, the difference is about the same for bit error rate (BER) and frame error rate (FER). These serve as reference curves; we have 2-bit limited-precision sigmoid-based quantizations that approach the BSC curve, and the 3-bit quantizations presented in this chapter surpass it. Considering the 2 dB loss, it seems appropriate to receive soft decisions when there is soft-information in the decoder. Our decoder design assumes a soft-decision receiver.

Fig. 1.3
figure 3

Decoding difference between AWGN and BSC channels, using C

1.3 Planjery’s Beyond Belief Propagation

Planjery et al. devised 2-bit and 3-bit quantized decoding designs for the decoding of LPDC codes on the BSC. These algorithms begin with a single bit quantization (a hard decision) at the receiver. Another quantization occurs at each parity check, and messages are quantized at each variable-node update. Other algorithms in the literature quantize in a similar two-quantizations-per-iteration fashion; as illustrated in Fig. 1.4.

Fig. 1.4
figure 4

Quantization of the variable nodes and the parity computation

We replicated the quantized 3-bit algorithm specified in Planjery’s paper [2]. To verify our implementation, we replicated their decoding results (100 iterations) using the published codes (benchmarks) that they used for testing. We then ran simulations upon C, with both 10 and 100 iterations. The results are the two upper curves of Figs. 1.5 (BER) and 1.6 (FER).

Fig. 1.5
figure 5

BER for the published and proprietary decoders of Planjery et al., using C

Fig. 1.6
figure 6

FER for the published and proprietary decoders of Planjery et al., using C

Planjery also produced, using a specialized 3-bit proprietary quantization and algorithm, improved results through an approach designed to overcome the influence of trapping sets. With Shiva Planjery’s gracious cooperation we were able to obtain the resulting performance curve of their proprietary decoder applied to the LDPC code that came from our own permutation construction. Transformed from their crossover probability to our SNR axis, this curve is shown in Figs. 1.5 (BER) and 1.6 (FER) and repeated in Figs. 1.10 and 1.11 as a comparison for our quantizations.

1.3.1 Synthesis of the Planjery-Vasic 3-bit Decoder

We implemented the published 3-bit logic in Verilog HDL. The synthesis results, targeting our Cyclone II FPGA, were reported by the Altera Quartus II software, giving a baseline for the cost of their published algorithm. The single bit computation used 138 logic elements and had a longest path delay of 20.489 ns. If we were to compute 1162 bits (the length of our LDPC code) simultaneously, the footprint would expand to 160356 logic elements. If we were to compute, sequentially, the 100 iterations used in Planjery and Vasic’s simulations, the decoding latency would be multiplied to 2.0489 microseconds. We programmed this design into the DE2 board for verification and demonstration.

Their second stage proprietary rule accounts for about 1.5 dB additional decoding gain and it increases the implementation logic and latency by an amount unknown to us. The quantizations that we propose in the following sections require more logic elements than their baseline, but our performance results show a great return from the additional logic.

1.4 Our Quantization Work

This section explains how and where quantization is applied within the algorithm, what quantizations we chose to use, and the results that we obtained. We start from the δμ SPA formulation we proposed in [11].

1.4.1 One Computation per Iteration

The SPA is typically described as two computational steps. We treat the iteration as a combined-step instead of the two separate steps; quantization is applied once rather than twice in an iteration. The intermediate parity-check values are indirectly quantized, but not specifically by the design. Figure 1.7 illustrates the whole-iteration computation.

Fig. 1.7
figure 7

Quantization of the variable nodes. One quantization per iteration

1.4.2 Quantization Scales

Our formulation and quantization values are expressed in δμ representation. Several 5-bit quantizations proved to be very effective in LDPC decoding in our previous effort [10]. Among the quantization schemes tested was one using the sigmoid function, S(x) = 1/(1 + e x). In [4] we presented two related 3-bit sigmoid-based quantizations, using sigmoid function evaluations at certain intervals to determine the discrete scale values S(x): x = ± 1.5 n = ± {1.5,3.0,4.5,6.0} and x = ± 2.0 n = ± {2.0,4.0,6.0,8.0}. These show particular promise for decoder quantization over a tested range of Gaussian channel SNR values.

The chosen step thresholds are the means between the step heights. The step-function mapping of δp assigns the quantized value s i , choosing i such that t i−1 ≤ δp ≤ t i . The two tested quantization scales are titled the “635” sigmoid-based quantization, illustrated in Fig. 1.8a, and the “762” sigmoid-based quantization, illustrated in Fig. 1.8b. Step and threshold values for both quantizations are given in Tables 1.1 and 1.2.

Fig. 1.8
figure 8

Sigmoid-based quantizations (a) “635” and (b) “762”

Table 1.1 Step values “S” (δμ) for the “635” and “762” quantizations
Table 1.2 Threshold values “T” for “635” and “762” quantizations

Notice how, for both scales, the precision is concentrated in the regions of greatest certainty; the step functions have finely-spaced steps at the two extremes. This family of quantizations suggest an implementation strategy for varying the decoder precision; such a strategy could compete with other adaptive error correction technologies that have been developed (rate compatible codes, etc.). The two quantizations tested differ only in how the x values of the sigmoid S(x) are selected, their similarities might simplify the implementation of an adaptive design offering both quantizations.

1.4.3 Decoder Performance

One of our quantization scales was better for low SNR conditions and the other was better for high SNR conditions, as SPA simulation results show in Figs. 1.9 (BER) and 1.10 (FER). Our quantized designs were tested with 10 iterations; increasing to 100 iterations resulted in only minor additional gains (1/4 dB BER, 1/3 dB FER). The graphs show comparable results from a simulation by Planjery, using their proprietary decoder upon our code C.

Fig. 1.9
figure 9

BER for our sigmoid-based “635” and “762” quantizations, using C

Fig. 1.10
figure 10

FER for our sigmoid-based “635” and “762” quantizations, using C

The small thin vertical bars on the graphs show the upper end of a 95 % confidence interval for each of our simulation result values. These confidence intervals can be reduced with longer simulations (more samples). The confidence intervals that we present are small enough to firmly assert the claims that: “762” outperforms “635” over the [4.0, 5.0] SNR range and “635” outperforms “762” over the [1.0, 3.5] SNR range.

The BER gain is about 0.9 dB better than the Planjery and Vasic proprietary algorithm over a substantial range. Somewhat less substantial FER gains, around 0.5 dB, are also seen over most of the tested SNR region. A design adapting between our two quantizations outperforms their approach over the entire tested range.

1.4.4 Synthesis Results

In our quantization approach, as described above, limited precision is applied to the receiver sampling and to the variable-node updates. Using this, we implemented a combined parity-check and variable-node-update calculation using a mixture of calculations, logic, and a lookup table. The 3-bit inputs into each (6,3) parity check yield one of 112 possible values (that is far less than the 2(3×5) apparent input combinations). Another way to express this is as an imputed quantization—the parity check output requires no more than seven bits, since 112 < 27. Two parity checks and the original sample factor into the update calculation, specified as a 112 × 112 × 8 lookup table. Additional symmetries make it unnecessary to implement this complete table. Our technique for finding the simplifications was to allow the Altera Quartus II synthesis tool to do the simplifying for us. For our tested quantizations, the tool consistently digested the lookup table (specified in Verilog HDL) and produced a result with a complexity reduced by a factor of about 1000. The cost for each effort was an overnight (8 1/2 h) run of the Quartus II synthesis, place, and routing tool.

The tool returns the number of logic elements (LE), which are required for the design and it computes, after placing and routing in an optimal manner, the longest path delay (LPD) between any pair among the inputs and outputs. The inverse of the LPD is the highest appropriate clock frequency for the logic when used in a clock-synchronous design. The synthesis results for our two quantizations are reported in Table 1.3.

Table 1.3 Synthesis results for each quantization

Calculating one variable update using two associated parity checks, synthesized to less than 5,000 logic elements. When the expressed design was expanded to include all three associated parity checks and compute all three of the resulting variable node updates, the design footprint more than doubled, but it did not triple and the delay increased by less than 20 %. We can deduce that the three-message logic synthesized to a blend of shared computation and parallelism.

The chosen Cyclone II FPGA is too small for the 1162 replications of this design needed to handle all of the bits of a code word simultaneously. With limited parallelization [7] or serial implementation [9] a complete FPGA-based decoder is still entirely feasible.

The LPD figures include some amount of input/output (I/O) delay that is characteristic of the FPGA. Since a multiple iteration decoding operation might be able to omit I/O between iterations, we sought to isolate this contribution. Building one simple model with a single exclusive-or (XOR) gate and another design with a cascade of two XOR gates, we determined from an extrapolation of the two design’s LDP values the contribution of the I/O to be 11.561 ns. Adjusted LPD figures are shown in the rightmost column of Table 1.3.

1.5 Comparing Decoders

Our synthesized designs have three to four times the adjusted per-iteration latency of Planjery’s published design (per our implementation of their design and our synthesis results). Since our decoder exceeds, in 10 iterations, the decoding gain of Planjery’s proprietary decoder with 100 iterations, we compute the total decoding time for one bit to be 10 × 31.538 = 315.38 ns for our design and 100 × 8.928 = 892.8 ns for Planjery’s published design. The timing advantage of our decoder, having accounted for a worst-case FPGA I/O contribution, is at least 65 %.

The logic circuitry of our decoder, with its quantizations, was larger than the logic to implement their decoder, but our decoding operation was faster and obtains better decoding results for the tested ranges of SNR, BER and FER. Our computation for one code symbol fits within the selected FPGA; and we could readily use this chip to decode a full codeword in a serial fashion. Alternatively, we could increase throughput by using a larger chip or by adapting this design for an ASIC. Using a larger chip would give us greater throughput and parallelization opportunities; these can be explored more thoroughly under the engineering constraints of a specific application.

Our results, using 3-bit samples from a Gaussian channel, have 0.5–0.9 dB better gain than the hard-decision receiver approach used by Planjery et al. [2]. A conclusion from this is that a receiver that can sample incoming symbols with three bits is better than one that makes a hard-decision. The fidelity available at the receiver sampling point should not be discarded. The quantization selected for 3-bits of precision does make a difference; considering the channel conditions is necessary when trying to choose the best possible quantization. Because we found that one of our quantizations was better in the lower SNR range and the other was better in the higher SNR range, we proposed a decoder that adapts between our two quantizations according to variations of the channel conditions. As channel conditions change, the current noise level could be estimated from the sample variance. The 33,216 LE capacity of our FPGA could accommodate the logic of both of our quantizations with enough additional room for the logic to measure the channel SNR and select the quantization adaptively. The adaptive decoder, illustrated in Fig. 1.11 surpasses Planjery’s decoder on the AWGN channel by approximately 0.9 dB over a substantial waterfall range (BER 10−2–10−7).

Fig. 1.11
figure 11

Adaptive decoder that uses the “635” and “762” sigmoid quantizations

Although the single iteration latency is greater than that of the Planjery et al. design, our decoding success with 10 iterations means that a decoder solution that is better for a range of SNR conditions can be reached in less time. We believe there is a potential for parallelization and pipelining, but even working through the bits one at a time in a serial fashion, the 430 ns-per-bit processing would support over 2 Mbps decoding throughput. FPGA-based signal-processing solutions are of interest for applications in software-defined radio (SDR) which require reconfigurability [22]. The FPGA-based decoding capability we propose is adequate to fulfill the diverse narrowband requirements of one particular contemporary system and achieves the lower throughput threshold for wideband operations [23].

Our synthesis is of Planjery’s published design. Two assumptions allow us to compare our decoder to their proprietary design: (1) that the proprietary enhancements increase latency and (2) that the proprietary design requires only a modest increase in their resulting logic. With these assumptions, the comparison, summarized in Table 1.4, favors our decoder on two of three evaluation criteria.

Table 1.4 Design comparisons