1 Introduction

Low-Density Parity-Check (LDPC) codes and turbo codes are used in a wide range of applications, like mobile, wireless and wired communication, deep space and satellite links, TV broadcasting, magnetic storage, and flash memories. Quite a large number of hardware implementations are available in the open literature for turbo and LDPC decoders, and most of them are based on parallel architectures, made of several interconnected Processing Elements (PEs). The capabilities of these decoders in terms of supported codes strongly depend on the structure of the interconnect network among the PEs. Simple and efficient structures are used to support families of homogeneous codes, like the ones included in a single or limited number of communication standards. However, the design of a decoder with larger versatility, able to support both turbo and LDPC codes in a wide set of standards, requires the allocation of more complex and flexible networks. A possible solution toward the implementation of higly flexible channel decoders is given by Networks-on-Chip (NoCs), which were originally devised for general purpose System-On-Chip (SoC), with multiple applications that are mapped on a number of PEs. They potentially guarantee very high flexibility and better scalability with respect to traditional bus based solutions. Lately, the original idea of general purpose NoCs has evolved, and new kinds of NoCs are today proposed with features fully optimized for a single or reduced number of applications (Application-specific NoCs, or ASNoCs, [6]). LDPC and turbo decoders based on very special ASNoCs have been proposed, providing a high degree of flexibility [10, 19, 20, 2729, 32]. These implementations support multiple standards and code types, with dynamic switching capabilities between codes.

In general, ASNoCs proposed for channel decoding are intra-IP NoCs that connect homogeneous PEs; in order to reduce occupied area and dissipated power, they are also characterized by lower complexity and limited capabilities with respect to NoCs usually reported for other applications. For example, it is shown in [27] that the best choices in terms of NoC topology and routing algorithm for a NoC-based turbo decoder are given by Kautz topology, which is less regular than usual 2D-mesh but guarantees shorter delivery time and shortest path algorithm, which can be implemented with a very low complexity. In spite of these choices, the NoC is responsible for a relevant efficiency gap between NoC-based highly flexible decoders and dedicated or partially flexible solutions, such as for example implementations in [33, 35]: the NoC guarantees virtual connectivity among all nodes and great flexibility, but packet latencies and intermediate storage, along with routing logic and memories, increase power consumption and decrease throughput with respect to dedicated and partially flexible decoders.

General power reduction techniques, like clock gating and dynamic voltage scaling, can be applied to channel decoders. Additionally, specific features of LDPC and turbo decoding can be exploited to reduce power dissipation. As an example, since LDPC and turbo decoding are iterative processes, early stopping of iterations criteria have been proposed over the years [21, 25]: these techniques rely on the observation of a metric to decide if it is worth or not to perform additional iterations, avoiding unnecessary energy consumption. The usage of power reduction techniques is in many cases very effective; however, since they can be introduced in any decoding architecture, the power consumption gap between NoC-based decoders and dedicated architectures is still large.

This paper proposes new power reduction techniques for flexible channel decoders. These techniques reduce and optimize the traffic due to messages exchanged among PEs, which account for a significant percentage of the consumed energy. Therefore, the proposed methods are particularly effective with NoC-based decoders. Preliminary contributions in the direction of traffic reduction have been made in [30] for turbo codes and in [7] for LDPC codes, both dealing with the evaluation of the usefulness of exchanged information: this work refines and extends them while proposing new traffic optimization techniques. The performance of the proposed techniques has been extensively evaluated and compared to alternative methods, while the most promising one has been implemented as an application example on a fully characterized decoder [10]. Area overhead and power gain have been obtained and compared to the state of the art to evaluate the effectiveness of the solution. Both multi-standard decoders [1, 16] and single-standard, optimized implementations [11, 34] are taken in account. The comparisons show how the proposed solutions greatly improve the energy efficiency of NoC-decoders and help reduce the gap between flexible and dedicated implementations. For example, the proposed multi-standard LDPC/turbo decoder consumes 99.2 mW, against the 51.6 mW of the decoder in [11], regardless of its support being limited to WiMAX LDPC codes only and being optimized for ultra low power performance.

The rest of the paper is organized as follows: Sect. 2 introduces LDPC and turbo decoding and analyzes the problems arising in parallel implementation. Section 3 proposes different solutions to these problems, while their performance is compared against alternative approaches in Sect. 4. Section 5 describes the hardware architectures of the proposed techniques, and in Sect. 6 the implementation results of the best-performing method are compared to the state of the art. Conclusions are drawn in Sect. 7.

2 NoC-Based Decoding: Practical Issues

LDPC codes [15] are described by a sparse matrix with M rows and N columns. Each word \(x\) that satisfies \(\mathbf H \cdot x = 0\) is considered as valid codeword. LDPC decoding is based on the Tanner graph representation of H, composed of Variable Nodes (VNs, the columns of H) and Check Nodes (CNs, the rows of H). The Belief Propagation (BP) algorithm is the most common algorithm for LDPC decoding, especially with the efficient layered scheduling [17]. In a layered decoder, parity-check constraints are grouped in layers of unconnected H rows: extrinsic information is passed from one layer to a subsequent one [17], reiterating the process up to the desired level of reliability. Let the Logarithmic Likelihood Ratio (LLR) of bit \(c\) in column \(k\) of H be represented with \(\lambda _k[c]\), and initialized to the corresponding received soft value. For all parity constraints \(l\) in a given layer, VN \(k\) receive the extrinsic information \(\lambda ^\mathrm{old}_k[c]\) from the previous layer and produce an updated version \(\lambda ^\mathrm{new}_k[c]\), sent to the following layer. The update is based on the metric \(R^\mathrm{new}_{lk}\), computed by CN \(l\) and stored for usage as \(R^\mathrm{old}_{lk}\) in the next iteration.

Convolutional turbo codes are typically specified as the parallel concatenation of two Convolutional Code (CC) encoders. The decoder is consequently made of two different Soft-In-Soft-Out (SISO) decoders, linked by an interleaver \(\Pi \) and a de-interleaver \(\Pi ^{-1}\). Each SISO decoder relies on the BCJR algorithm [3]. With each CC represented with a trellis, let \(k\) be a trellis step and \(u\) an uncoded symbol. Each decoder computes

$$\begin{aligned} \lambda _k[u] = \lambda ^\mathrm{apo}_k[u] - \lambda ^\mathrm{apr}_k[u] - \lambda _k[\mathbf {c}^u], \end{aligned}$$
(1)

where \(\lambda ^\mathrm{apo}_k[u]\) and \(\lambda ^\mathrm{apr}_k[u]\) are the a-posteriori and a priori information respectively, and the systematic component of the intrinsic information is represented by \(\lambda _k[\mathbf {c}^u]\).

In parallel decoders, the decoding process of the received frame is typically partitioned among \(P\) PEs. Messages are exchanged among PEs by means of an interconnection structure that is usually deterministic and guarantees fixed and uniform latency. To increase the degree of flexibility of the decoder, recent works have proposed NoCs as interconnection structures [10, 19, 27, 29]. Different techniques are available in the state of the art to partition the received frame and the decoding process among PEs: they all rely on the assignment of a set of variable nodes (in the LDPC case) and trellis steps (in the turbo case) to each processing element. Even though random partitioning has been proven to grant acceptable results, a graph coloring approach that minimizes the inter-PE communication has been employed in this case [7]. Figure 1 shows the basic structure of a NoC-based decoder. While in turbo decoders, the PEs are concurrent SISOs executing (1) on different sub-blocks, in LDPC decoders, each PE updates the LLRs involved in a certain set of parity check constraints. Consequently, the task array of each PE, i.e., the sequence of processes to be performed within an iteration, is a continuous set of trellis steps in the turbo case, and a uniform selection of parity checks from all H layers in the LDPC case. Each PE is connected to a Routing Element (RE), which in turn is linked to a number of other routers, with input buffers at every port. If every RE has an attached PE, the NoC has a direct topology (like the 2-D mesh and the generalized Kautz [24] shown in Fig.1), while in indirect NoCs (e.g., Benes [5]), some REs are only used as intermediate communication nodes. The NoC traffic is constituted of \(\lambda _k[u]\) and \(\lambda _k[c]\) values for turbo and LDPC codes, respectively, as shown in Fig. 1: messages are injected in the NoC directly by the PEs, while information received from the channel is stored in a memory for further use. The communication pattern with a NoC is deterministic, but the delays introduced are nonuniform and can vary consistently from code to code and with time. This nonuniform distribution of delays is basically due to the uneven distance among PEs. Choices like number of PEs, NoC topology, and routing algorithm have a significant impact on achievable throughput and implementation complexity, as extensively studied in [10, 27]. In general, a NoC used to support turbo and LDPC decoding is a particular type of NoC, characterized by very low complexity and reduced functionalities in comparison to usually considered NoCs. For example, since exchanged messages are usually six to ten bit values, NoC packets are single phits, sent using source routing. Moreover, the inter-PE communication patterns exhibit a pseudo-random nature, because of the limited adjacency of both interleavers used in turbo codes and parity check matrices associated to LDPC codes. Thus, the optimal mapping of processing tasks onto PEs does not provide any relevant benefit in terms of NoC traffic [26].

Fig. 1
figure 1

NoC-based parallel decoder structure

Additional techniques to optimize the NoC traffic are available. In particular, Quality-of-Service (QoS)-oriented networks can be considered for turbo and LDPC decoding. In [4], an area- and energy-wise breakdown of different router architectures is presented, providing a comprehensive overview of NoC designs. The CS network and the GuarVC network both target QoS improvement: the first one is a circuit-switched network that relies on simplicity and static allocation of resources, while the second makes use of multiple virtual channels and flow control: priorities are assigned to packets, and the traffic flow is optimized. However, CS networks are not effective for channel decoding, because the traffic pattern between PEs is not regular enough. Moreover, both circuit-switched and virtual channel-based NoCs tend to introduce large area and power consumption overheads. For example, a 22-PE NoC (the same choice made in Sect. 6) implemented as a CS network would be slightly larger than the whole flexible decoder in [16], and its power consumption would be higher by a 2.4 factor. The same NoC, implemented as a GuardVC network, leads to 1.3 times larger area and 2.9 times higher power consumption. Therefore, usual techniques to reduce traffic in NoCs are not effective in the case of NoC-based decoders, and dedicated solutions are required.

The complex interactions between the topology of the NoC, the number of PEs, the code, and the performance of the decoder have been analyzed with a dedicated bit-true and cycle-accurate tool (JANoCS) [9]. This cycle-accurate tool allows to simulate the processing and communication phases of the decoding process concurrently, observing the state of the network and the traffic-related issues together with Bit Error Rate (BER) performance. From extensive simulations it has been possible to observe how the on-time delivery of messages is of fundamental importance for the decoding process: to analyze this concept, let us define the Normalized Delivery Time (NDT) as

$$\begin{aligned} \mathrm{NDT}=\frac{t_\mathrm{d}-t_0}{t_\mathrm{u}-t_0}, \end{aligned}$$
(2)

where \(t_0\) is the sending time of a message, \(t_\mathrm{d}\) the arrival time at its destination PE, and \(t_\mathrm{u}\) the instant when the considered message has to be used at the destination PE. Given a certain decoder architecture, \(t_\mathrm{u}\) depends on the target throughput while \(t_\mathrm{d}\) is related to the NoC clock frequency. Messages that are delivered on time, i.e., that reach their destination before \(t_\mathrm{u}\), are characterized by NDT \(\le 1\). Late messages have \(t_\mathrm{d}>t_\mathrm{u}\), leading to NDT \(>1\). The condition NDT \(<1\) tends to be very restrictive for codes used in current standards, because of the already mentioned low degree of adjacency in typical inter-PE communication patterns [36]. As a consequence, also a smart mapping of tasks on PEs does not allow for any useful clustering, and \(t_\mathrm{d}\) values have a strongly nonuniform distribution.

In order to guarantee NDT \(<1\) for the totality of messages, either the throughput must be reduced (which means reducing the PEs clock frequency), or the NoC clock frequency should be increased (without altering the PEs clock frequency) [10].

An ideal situation for a decoder is shown in Fig. 2, that plots the number of messages with respect to the NDT for a WiMAX code of block size 2304 and rate 1/2, decoded with a 16-PE decoder mapped on a Kautz network: in this decoder, to have NDT \(<1\) for all messages, the NoC frequency is 420 MHz, while the PEs only run at 280 MHz.

Fig. 2
figure 2

Message normalized delivery time distribution—all messages reach the destination on time

Figure 3 gives the message delivery time distribution for the same code, with the difference that both PE and NoC frequency are 280 MHz. Here, a consistent percentage of messages has NDT \(>1\). Late messages are extremely disruptive for the performance of the decoder. In case a message has not been delivered at \(t_\mathrm{u}\); the destination PE can use the value computed at the previous iteration, but this leads to additional errors. Figure 4 plots the BER curves of WiMAX turbo code of size 960, rate 1/3 and WiMAX LDPC code of size 2304, rate 1/2 under different percentages of late messages for illustration. These are obtained by changing the frequency of the 16-PE Kautz NoC used in Figs. 2 and 3. Simulations have been performed on an AWGN channel, with BPSK modulation and fixed point precision (10 bits, 3 of fractional part). It can be seen that very small amounts of late messages (\({<}1~\%\)) degrade the decoder performance, while larger percentages completely compromise the decoder error correction capabilities. It is clear that late messages cannot be simply discarded; however, efficient methods to reduce and optimize the traffic are introduced in the following Section.

Fig. 3
figure 3

Message normalized delivery time distribution—some messages do not reach the destination on time

Fig. 4
figure 4

BER degradation due to late information update

3 Traffic Reduction and Optimization

Reducing the number of messages traveling on the NoC is bound to speed up the delivery of those remaining, with shorter queues and fewer collisions. Two techniques have been devised and tested toward these goals: they are based on the general concept that not all information messages (which are of a probabilistic nature) traveling on the network are essential for the success of the decoding. These two techniques (Sects. 3.1 and 3.2) can be used either alone or combined. Two additional methods to optimize the NoC traffic are described in Sect. 3.3.

3.1 Hard Importance

The Hard Importance (HI) method allows to refrain from sending messages that are estimated to be of low impact on the decoding process. A similar approach was considered in [7, 30]. In the LDPC decoding case, the messages traveling on the NoC are different updates of \(\lambda _k[c]\). The HI checks are performed once per iteration to each of them. Consider the following comparisons:

$$\begin{aligned} sgn(\lambda ^n_k[c])&= sgn(\lambda ^{n-1}_k[c])\end{aligned}$$
(3)
$$\begin{aligned} |\lambda ^n_k[c]|&\ge |\lambda ^{n-1}_k[c]|\end{aligned}$$
(4)
$$\begin{aligned} |\lambda ^n_k[c]|&\ge \mathtt{{Thr}}^{HI}\cdot \max (\lambda _k[c]), \end{aligned}$$
(5)

where \(n\) expresses the nth iteration, and \(\max (\lambda _k[c])\) is the maximum possible value of \(\lambda _k[c]\) given the number of bits assigned to its representation, while \(0\le \) \(\mathtt{Thr^{HI}}\le 1\) expresses the percentage of \(\max (\lambda _k[c])\) involved in the comparison. If all above three conditions are verified, \(\lambda _k[c]\) is flagged as unimportant, and the bit LLR is not updated anymore for the rest of the decoding process. The first two comparisons check the presence of a monotonic divergence from zero in the LLR value, while the third requires a large enough absolute value to be satisfied. Compliance with all three checks confirms that the information is already reliable, and that a change of sign (and consequently a bit flip) is extremely unlikely. Since with layered scheduling a \(\lambda ^k[c]\) is updated many time within each iteration, a single unimportant LLR will result in a traffic reduction of several messages.

For turbo decoding, the choice of stopping or not a message can be made by modifying the Symbol Reliability Difference (SRD) criterion proposed in [30]. Defining

$$\begin{aligned} \delta ^{(i)}_m(d_k)=L^{(i)}_m(d_k=d^1_k) - L^{(i)}_m(d_k=d^2_k) \end{aligned}$$
(6)

as the difference between the logarithmic extrinsic probabilities of the first and second most probable symbols, the original SRD criterion proposes, for each symbol \(d_k\)

$$\begin{aligned} \phi ^{(i)}_{m,m'}(d_k)=|\delta ^{(i)}_m(d_k)-\delta ^{(i)}_{m'}(d_k)|, \end{aligned}$$
(7)

where \(m'\) refers to the metrics at the input of the SISO, coming from the previous half-iteration, and \(m\) at the output. If condition

$$\begin{aligned} \phi ^{(i)}_{m,m'}(d_k)\le \mathtt{{Thr}}^{HI}_{Abs} \cdot \max (\phi ^{(i)}_{m,m'}(d_k)) \end{aligned}$$
(8)

is satisfied, the message can be stopped. Applying this method as is, however, led to unsettling results. This is due to the fact that it is not taken in account that (7) could give very low \(\phi ^{(i)}_{m,m'}(d_k)\) also if both \(\delta ^{(i)}_m(d_k)\) and \(\delta ^{(i)}_{m'}(d_k)\) are very close to 0. In this case \(\phi ^{(i)}_{m,m'}(d_k)\) expresses just uncertainty about symbols, instead of agreement between SISOs, and the message should not be stopped. For this reason, an additional control has been added for the message to be stopped

$$\begin{aligned} |\delta ^{(i)}_m(d_k)|,|\delta ^{(i)}_{m'}(d_k)| \ge \mathtt{{Thr}}^\mathrm{HI}_\mathrm{Diff} \cdot \max (\delta ^{(i)}(d_k)), \end{aligned}$$
(9)

where \(\mathtt{Thr}^\mathrm{HI}_\mathrm{Diff}\) assures a degree of reliability on the symbol.

The performance of HI is of large interest, since this method can be applied also to other types of decoders beside NoC-based ones. HI acts on messages by deciding if it is worth updating a value or not, and can be effective also in the absence of a NoC. It can consequently be exploited in decoders that rely on shared memory banks: in this case, energy is saved by reducing the number of memory write operations.

3.2 Soft Importance

The Soft Importance (SI) technique evaluates the state and the evolution of the information exchanged through the NoC, flagging non-essential messages as expendable. In case of collisions, i.e., messages that need to be routed through the same output port, the router arbiter will forward a message and discard all the expendable messages that were not granted priority. In LDPC decoding, (3)–(5) are applied in SI with a more relaxed \(\mathtt{Thr}^\mathrm{SI}\), while the metrics considered are not \(\lambda ^n_k[c]\) and \(\lambda ^{n-1}_k[c]\), but \(\lambda ^\mathrm{new}_k[c]\) and \(\lambda ^\mathrm{old}_k[c]\). Each PE monitors the evolution of the LLR locally, before and after the processing: if all three comparisons are verified, the message is expendable. Since in turbo decoding the HI method already affects each message separately, the SI method can be efficiently implemented with exactly the same mechanism as HI. The expendable flag is applied to messages satisfying the modified SRD conditions with more relaxed thresholds \(\mathtt{Thr}^\mathrm{SI}_\mathrm{Abs}\) and \(\mathtt{Thr}^\mathrm{SI}_\mathrm{Diff}\).

3.3 Urgency and Buffer Reordering

Smarter and more efficient communication on the NoC can be obtained through the identification of urgent and less urgent messages: traffic optimization deals with the late message issue by prioritizing the former against the latter. Priority-based routing is a well-explored path to guarantee QoS: multiple virtual channels are often assigned different priorities to differentiate traffic flows [12, 26]. The concept of priority is applied here in an original way, by a single channel with a reordering buffer.

The Urgency technique (U) implements a priority-based collision management policy. In case of collision, priority is given to the most urgent message, i.e., the message which is needed by its destination PE sooner. To allow this kind of decision, an urgency field must be added to the message during sending, initialized with an estimate of the number of clock cycles available before the message is needed by another PE. The field must be updated by the routers, taking in account the wait cycles spent in input buffers, and a message is discarded if its urgency reaches zero, avoiding unnecessary switching activity for late messages. With the LDPC case, each PE can perform an estimation based on local knowledge of the instant in which the outgoing message is going to be needed. The precision of the estimate strongly depends on the regularity of the partitioning of the H matrix among the PEs. On the contrary, in the turbo case, since the interleaving rule is known to all PEs, the measure can be exact.

In Buffer Reordering (BR), a fast lane can be created by arranging the messages in the input buffers not in arrival order, but according to the urgency field. The most urgent message in a buffer will consequently always be the first one to be pulled out, increasing its chances of arriving on time.

4 Performance Results

The impact of each of the proposed techniques, alone and in combination with one another, has been evaluated with the JANoCS tool [9]. Extensive simulations have considered a wide range of codes, NoC topologies, number of PEs, routing algorithms, PE, and RE architectures. Table 1 lists the percentages of late and stopped messages for a set of LDPC and turbo codes taken from WiMAX [22] and 3GPP-LTE [31] standards, considering a decoder implementing different combinations of the proposed techniques. The codes have been mapped on 16-PE and 32-PE NoCs with generalized Kautz [24] and two-dimensional mesh topologies. Meshes are a common topology for middle-sized NoCs, and the simple X–Y routing algorithm [13] joins good performance with ease of implementation. Kautz networks have been proven effective for turbo and LDPC decoding in [27] and [8], respectively, also in the presence of REs of degree three: routers implement a shortest path routing algorithm [27]. With all the considered NoCs, each PE produces one \(\lambda _k[c]\) message per clock cycle in the LDPC case, and one \(\lambda _k[u]\) message in the turbo case. While with single-binary turbo codes, \(\lambda _k[u]\) consists of a single value, three values are necessary in double-binary turbo codes: the width of the simulated channel is adapted accordingly.

Table 1 Percentages of late messages (% late) and stopped or not sent messages (% stop) for no traffic handling (No TH) and combinations of the proposed methods on 16-PE and 32-PE generalized Kautz and 2D-Mesh NoCs, at BER \(=10^{-5}\)

HI and SI show high percentages of stopped messages (up to 29 %), with substantial reduction in late messages (down to 1.6 %), especially for HI and HI+SI. Results for HI and SI were obtained at the SNR point for which BER \(=10^{-5}\) with 10 maximum iterations for LDPC codes and 8 maximum iterations for turbo codes. Iterations are stopped as soon as the codeword is correct, and the number of stopped messages is averaged over all performed iterations. When this kind of early stopping criterion is not present, HI can work as a valid substitute. In fact, if a codeword is correct and its decoding continues, the magnitude of all LLRs keeps growing and HI effectively prevents all messages from being sent. HI and SI inherently introduce some BER degradation, for which careful threshold calibration is necessary: if set too low, the stopped messages can still carry information about uncertain bits, introducing new errors. A wide range of possible threshold values have been considered and simulated, with the final choice representing a good tradeoff between method effectiveness (i.e., number of stopped messages) and BER degradation. The decoder can incur in additional errors, in case, a metric update is stopped too early by HI or SI: however, threshold calibration allows for results similar to those shown in Fig. 7, where the impact of HI on BER for an LDPC code and a turbo code are presented. The percentages of stopped messages assumed in these simulations are consistent (17 and 18 %, respectively), but both the LDPC and the turbo code show negligible performance losses. In the plots of Fig. 5, the thresholds have been set as \(\mathtt{Thr}^\mathrm{HI}=0.2\) and \(\mathtt{Thr}^\mathrm{SI}=0.1\), while in Fig. 6 as \(\mathtt{Thr}^\mathrm{HI}_\mathrm{Abs}=0.25\), \(\mathtt{Thr}^\mathrm{HI}_\mathrm{Diff}=0.15\), \(\mathtt{Thr}^\mathrm{SI}_\mathrm{Abs}=0.15\), and \(\mathtt{Thr}^\mathrm{SI}_\mathrm{Diff}=0.05\). These thresholds are also used to derive the “Ideal THR” curves in Figs. 8 and 9, that show how variations in the threshold values affect the BER performance.

Fig. 5
figure 5

Impact of the proposed techniques and their combinations—LDPC codes

Fig. 6
figure 6

Impact of the proposed techniques and their combinations—turbo codes

Fig. 7
figure 7

Impact of HI on BER

Fig. 8
figure 8

Impact of different threshold choices on HI+SI+U+BR - LDPC codes

Fig. 9
figure 9

Impact of different threshold choices on HI+SI+U+BR - turbo codes

The results given by the urgency U method alone are not satisfying, since its effects are mostly appreciated in case of collisions, which can involve also non-critical messages, its effectiveness alone is limited. However, as soon as a message is identified as late it is discarded: this means that the percentage of stopped messages for which U contributes is equal to the percentage of late messages. The U+BR urgency-based buffer reordering, which allows non-urgent messages to be delayed in favor of critical ones, drastically reduces the occurrence of late messages (from 11.1 to 0.2 %). Its effectiveness can be improved further by combining the two traffic reduction methods with the traffic optimization ones. The joint application of all four techniques (HI+SI+U+BR) guarantees a late message percentage close to zero in most cases, while substantially reducing their impact in the remaining cases.

From Table 1, it is possible to make some important observations. As expected from previous analysis [8, 27], the Kautz topology performs better than 2D-mesh when targeting LDPC and turbo codes. Moreover, turbo codes suffer less from late messages w.r.t. LDPC codes (No TH column), thanks to their less critical communication phase. It can also be noticed how the size and the rate of the code affect the number of late messages, especially when related to the size and topology of the NoC. A small LDPC code has small H layers and a small turbo code has short half-iterations: consequently, the available message delivery times are limited. Moreover, small codes mapped on a large NoC suffer from a large number of late arrivals, since the distance between PEs dominates the transmission times. This is the case of the WiMAX LDPC 576, rate 5/6 in Table 1: similar effects are encountered with larger codes when mapped on the 32-PE NoCs. On the contrary, queues and collisions are the main sources of delay in case of large codes mapped on small NoCs. These limit cases (e.g. LDPC 576, rate 5/6 in Table 1) cannot be completely solved with the implementation of the proposed traffic handling techniques, and need to work in conjunction with alternative techniques: for example, the code can be mapped on a smaller portion of the NoC, and the unused part of the decoder can be deactivated to save energy.

The percentage of late messages is directly reflected on the decoder performance, as shown in Figs. 5 and 6: a high percentage of late messages results in an unreliable decoding process. The effects of the different combinations considered in Table 1 on the BER of an LDPC and turbo code are plotted, using the same NoC and PE architectural choices. As an example, Fig. 5 shows the BER results for a WiMAX LDPC code of rate 1/2 and block size 1440, with 10 maximum iterations; the duo-binary WiMAX turbo code in Fig. 6 has an information block size of 2400 symbols and rate 1/3, decoded with 8 iterations. However, the behaviors observed in Figs. 5 and 6 do not depend on the choice of the codes, but only on the percentage of late messages. In both plots, the “ideal network” curve represents an ideal decoding process, where the interconnection structure does not introduce any latency (lower bound), while the “no traffic handling” curve shows the BER in case of a real decoding process in which no one of the methods is applied (upper bound). An actual decoder would obtain this BER curve if it did not wait for the delivery of any of the late messages that would arise given a certain NoC topology and partitioning of tasks. It can be noticed how all the performance curves of the proposed solutions span the interval between the upper and lower bound. HI, SI, and HI+SI curves do not provide substantial performance improvement, though giving interesting results in terms of traffic reduction, and consequently reducing the switching activity. The U curve is still very close to the “no traffic handling” one, reflecting its limited effectiveness shown in Table 1. The U+BR curve shows the performance in the presence of the powerful buffer reordering, with a further step towards the reference curve made by the HI+SI+U+BR curve, that combines all four methods. Though the proposed techniques behave coherently in both cases, they yield slightly better results in the turbo case, as expected from Table 1.

Figures 8 and 9 show the impact of different threshold values on the BER performance of HI+SI+U+BR applied under the same conditions and to the same codes of Figs. 5 and 6. As mentioned earlier in this section, the “Ideal THR” curves have been obtained by simulating an extensive set of possible threshold values. The final choice has been made by selecting the threshold values that maximize the number of stopped messages without degrading the BER performance, and the obtained percentages of stopped messages are those reported in Table 1. The choice of the threshold is not very critical, as shown by curves labeled as “Similar THR” in Figs. 8 and 9, which have been derived by rising each “Ideal THR” threshold by 10 %: it can be seen how the curves are almost superimposed, and similar minor fluctuations are observed in the number of stopped messages as well. Larger threshold variations have much more influence than the BER: the thresholds used in the “High THR” curves are obtained by tripling the “Ideal THR” thresholds. Very high threshold values result in a very small percentage of stopped messages, and in the ineffectiveness of both HI and SI. In fact, “High THR” BER curves are very similar to the U+BR curves of Figs. 5 and 6, where HI and SI are not applied. On the contrary, very low thresholds as the ones used in “Low THR” curves (half of “Ideal THR” thresholds) dramatically increase the number of stopped messages. However, a large number of messages carrying useful information are stopped as well, causing consistent BER degradation.

5 Hardware Architecture

The similarity of the calculations involved in HI and SI, and the necessity for controls at each RE for SI, U, and BR allows for efficient resource sharing. The multi-mode decoder described in [10] has been taken as reference architecture: among the different implementations, the one denoted A in the paper has been selected and called A \(_\mathrm{ref}\). It relies on a 22-PE Generalized Kautz NoC, and complies with WiMAX, HPAV [18], and DVB-RCS [14] standards, with limited support for WiFi [23].

The HI method can be easily implemented in the SISO. At the beginning of each trellis step, one or more read operations from the data memory are required, and they provide the data needed to calculate the \(\delta ^{(i)}_{m'}(d_k)\) member of Eq. (7). At the end of the trellis steps, also the data needed to calculate \(\delta ^{(i)}_m(d_k)\) are ready, thus making it possible to compute Eqs. (7), (8) and (9).

A memory bit is required for each trellis step to signal if the outgoing messages are unimportant and must not be sent. For LDPC codes, since the considered metrics are \(\lambda ^n_k[c]\) and \(\lambda ^{n-1}_k[c]\), implementation of HI is less straightforward. The main issue is the fact that the storage of \(\lambda ^n_k[c]\) is performed by replacing \(\lambda ^{n-1}_k[c]\). Eq. (3) to (5) can, however, be executed without any additional memory for the storage of \(\lambda ^{n-1}_k[c]\) by configuring the data memory as Read-Before-Write. When a message is flagged as unimportant, the corresponding memory bit is set, and the unimportant flag must be propagated to all other PEs. A dedicated STOPPING message is sent in place of the unimportant message. Figure 10 shows the circuit responsible for the HI check and eventual creation of the STOPPING message: the PROCESSING block executes the computations necessary to update each \(\lambda _k[c]\). When a STOPPING message is received, the unimportant memory bit is set in the destination PE, and the STOPPING message is propagated. If a PE receives a STOPPING message when the unimportant memory bit has already been set, the STOPPING message is not sent anymore. The STOPPING message is mapped to the lowest negative value that can be represented with the allocated number of bits: for example, with 9 bits, the data dynamic range is mapped to the interval (\(-\)255, +255), and the value \(-\)256 is recognized as a STOPPING message.

The HI method can also be applied to save energy by simply reducing the number of memory write operations at the destination PEs. When a given message is received by its destination PE, the writing into the internal memory can be avoided if the message is recognized as unimportant. The implementation of this functionality still exploits the STOPPING message, which is used to control the write enable signal of the memory and prevent write operation. However, in this case, the STOPPING message must be sent to the memory at every iteration.

Fig. 10
figure 10

HI implementation for LDPC codes - STOPPING message

The implementation of SI follows that of HI in the turbo case, only requiring two additional comparators for the different thresholds in (8) and (9). It is instead much simpler than HI for LDPC codes, since both \(\lambda ^\mathrm{new}_k[c]\) and \(\lambda ^\mathrm{old}_k[c]\) are available during each parity check computation, and there is no need for flag propagation. Together with the simple computational logic in the PEs, a flag bit signaling if a message is expendable or not must be added also in all NoC FIFOs and channels, while changes in the write and read pointers (WPTR and RPTR) of each FIFO are forced by the RE arbiter in case of collisions.

The U method requires, for the initialization of the URGENCY field of outgoing messages, the estimation of the available delivery time. This measure can be obtained, in the turbo case, thanks to the current trellis step together with the globally known interleaving rule. Since each SISO processes a sequential set of trellis steps, the destination memory addresses is a precise identifier of the time instant a message will be needed. The URGENCY field of each outgoing message can be initialized as

$$\begin{aligned} U_\mathrm{turbo}=T_\mathrm{half}-t_\mathrm{send}+t_\mathrm{need}, \end{aligned}$$
(10)

where \(T_\mathrm{half}\) is the duration of a half iteration, \(t_\mathrm{send}\) is the time stamp of the sending instant, and \(t_\mathrm{need}\) equals to the destination memory address multiplied by the number of cycles needed to complete each trellis step. The destination address is also used in the initialization of U in the LDPC case. By multiplying it by the minimum row degree of H, a lower bound of \(t_\mathrm{need}\) is obtained, thus leading to the following equation

$$\begin{aligned} U_\mathrm{LDPC}=t_\mathrm{need}-t_\mathrm{send} \end{aligned}$$
(11)

The urgency field requires additional bits in all the NoC FIFOs and channels, together with the simple initialization logic at each PE. Moreover, each FIFO of length \(F\) needs \(F\) adders to update U at each clock cycle, while WPTR and RPTR must be updated in case the urgency field reaches zero, and the corresponding message discarded. All the FIFO memory elements must be available for writing at each clock cycle: the FIFO consequently must be implemented with registers, and not with a RAM. Finally, the priority of the RE arbiter is changed from being FIFO length-based [10] to urgency-based.

The implementation of BR requires all the modifications described for U, plus a novel method for the update of WPTR and RPTR. The RE input buffers in fact lose their FIFO nature, since the input order is not guaranteed to be the output order. Figure 11 shows the simplified structure of the proposed buffer reordering mechanism; white blocks represent registers, while gray blocks indicate additional computation elements. Along with the URGENCY field, each buffer element requires an additional VALID field. Read and write operations on the buffer take in account external signals (PUSH, POP) and the internal state of the buffer (IS_EMPTY, IS_FULL). Every time a write operation is performed, the VALID field of the corresponding buffer element is set, while it is cleared with a read operation. The VALID fields are necessary to keep track of the free and occupied elements, since the irregular input and output orders prevent WPTR and RPTR from being used for this purpose. During write operation, the urgency field from the incoming message \(\mathtt{URGENCY}_\mathrm{IN}\) is substituted according to WPTR to one of the stored \(\mathtt{URGENCY}_X\) during the U update process. Thus, the updated value of \(\mathtt{URGENCY}_X\) is \(\mathtt{URGENCY}_\mathrm{IN}-1\) instead of \(\mathtt{URGENCY}_X-1\). In a concurrent read operation, a \(\mathtt{URGENCY}_X\) value is selected as the buffer output \(\mathtt{URGENCY}_\mathrm{OUT}\) according to RPTR. The SEL module chooses the update value of WPTR among the elements with cleared VALID field with fixed priority. The MIN module, instead, updates RPTR as the pointer to the minimum \(\mathtt{URGENCY}_X\) among the VALID ones. The whole operation occurs within a single clock cycle, and correct functionality of the circuit has been tested in 90 nm CMOS technology for up to 10 buffer elements, with 200 MHz as target throughput. The area overhead introduced by the additional operations in the reordering buffer has been evaluated with respect to a typical FIFO buffer. For example, a reordering buffer with five buffer elements accounts for a little more than 2000 \(\upmu \mathrm{m}^2\), against the 850 \(\upmu \mathrm{m}^2\) of a regular FIFO, with a \(\times 2.35\) area increment factor.

Fig. 11
figure 11

Urgency-based buffer reordering

6 Implementation

To help a fair comparison with the state of the art, all the modifications have been applied to the A \(_\mathrm{ref}\) decoder, creating a new implementation A \(_\mathrm{new}\), targeting 90 nm CMOS technology: starting from VHDL models designed after the exploration performed with JANoCS, synthesis has been carried out with Synopsys Design Compiler, functional simulation and validation with Mentor Graphics ModelSIM, and place and route with CADence SoC Encounter. Table 2 dissects the post place and route area occupation of various components of the decoder before and after the implementation of the proposed methods. The small increase in memory occupation is due to the extra memory bit for the unimportant flag related to HI, shared between SISOs and LDPC PEs. The simple logic required for both HI and SI in the turbo case results in an additional 7.1 % area occupation for SISOs, while the more complex operations involved in the LDPC PE for HI lead to a higher area overhead. The widths of NoC buffers and channels have been increased to accommodate the URGENCY field (five bits), the VALID bit of BR and the expendable bit of SI. This, together with the additional logic for U and BR, heavily affects the NoC area, with an overhead exceeding 30 %. With a total area of 3.11 mm\(^2\), the modified decoder is 13.1 % larger than A \(_\mathrm{ref}\). However, as mentioned in Sect. 2, A \(_\mathrm{ref}\) deals with the late message issue with a NoC clock frequency higher than the PE clock frequency: with the introduction of the traffic reduction and optimization techniques, however, this is not necessary anymore, and the decoder can be clocked with a single frequency. Table 3 details the worst case power consumption \(Pow\) for A \(_\mathrm{ref}\) and A \(_\mathrm{new}\) architectures: in the original decoder, the clock frequency \(f_\mathrm{clk}\) is set to 200 MHz for the PEs and to 333 MHz for the NoC. The global power consumption is 116.6 mW, with the NoC accounting for 41.7 % of the total. Total power consumption in A \(_\mathrm{new}\) is reduced of 15 % w. r. t. A \(_\mathrm{ref}\), while the power gain on the NoC alone reaches a very consistent reduction of 40.2 %. This result basically derives from the lower clock frequency of the NoC with respect to architecture A \(_{ref}\), but the clock frequency reduction is made possible thanks to the adoption of the described traffic reduction techniques. A final measure of the ratio between costs and advantages in reducing the NoC power consumption can be obtained as

$$\begin{aligned} \left( Pow^\mathrm{NoC}_{\mathrm{\mathbf A}_\mathrm{new}} + Pow^\mathrm{PE}_{\Delta }-Pow^\mathrm{NoC}_{\mathrm{\mathbf A}_\mathrm{ref}}\right) /Pow^\mathrm{NoC}_{\mathrm{\mathbf A}_\mathrm{ref}} = -35.8~\%, \end{aligned}$$
(12)

where \( Pow^\mathrm{PE}_{\Delta }\) is the power consumption increment in PEs due to the contribution of HI, SI, and U initialization. The implementation of HI+SI+U+BR, taking in account all the introduced overheads, brings a power reduction on the NoC equal to 35.8 % of \(Pow^\mathrm{NoC}_{\mathrm{\mathbf A}_\mathrm{ref}}\).

Table 2 Effect of traffic handling on area occupation (CMOS 90 nm technology, post-layout results)
Table 3 Effect of traffic handling on power consumption (CMOS 90 nm technology, 1.0 V supply)

The actual impact of HI on the energy consumption of centralized decoders can also be estimated. In [2], the energy breakdown of a decoder based on memory banks sharing is given. The energy consumption of the \(\gamma \)-memory accounts for 70 % of total dynamic energy for WiMAX LDPC code of size 576 and rate 5/6. For this particular code if HI is applied the percentage of stopped messages, and consequently of avoided write operations, is around 17 % (Table 1). Since write operations contribute for approximately 50 % of the energy expenditure, the implementation of HI leads to a 6 % reduction in the total decoder energy consumption.

A simple direct way to reduce the traffic on the NoC is to reduce the number of iterations of the channel decoder. Table 4 compares the impact of such method with respect to the proposed techniques in terms of energy efficiency and BER degradation. In this table, we consider the BER performance and energy consumption of A \(_\mathrm{red}\), i.e., A \(_\mathrm{ref}\) with a reduced number of maximum iterations, and compares it with A \(_\mathrm{new}\) and the original A \(_\mathrm{ref}\), for two code examples. The number of performed iterations is represented by \(n^\mathrm{(max)}_{it}\), E\(_\mathrm{frame}\) expresses the energy spent per decoded frame and \(\Delta \)SNR shows the performance degradation with respect to A \(_\mathrm{ref}\) when BER=\(10^{-5}\). In A \(_\mathrm{red}\), the reduced number of iterations (by only one iteration) leads to a proportional reduction of E\(_\mathrm{frame}\), but non-negligible performance degradation is present. It is especially noticeable in the turbo case, that relies on a smaller \(n^\mathrm{(max)}_{it}\). The proposed implementation outperforms the iteration reduction in terms of energy savings, while at the same time affecting the BER performance only marginally.

Table 4 Performance and energy consumption comparison (CMOS 90 nm technology, 1.0 V supply)

Finally, Table 5 compares the results of A \(_\mathrm{new}\) with A \(_\mathrm{ref}\) and few recent related state-of-the-art LDPC and turbo decoders. The energy efficiency has been introduced to help a fair comparison, and is defined as \(E_\mathrm{eff}=\mathrm{Pow}/(T\cdot n^\mathrm{(max)}_{it})\), where \(T\) is the achieved throughput and \(n^\mathrm{(max)}_{it}\) is the maximum allowed number of iterations, while the power consumption has been normalized to the CMOS 90 nm process. This measure expresses the energy spent for each decoded bit. The area efficiency, relates the area occupation normalized to the CMOS 90 nm process (An\(_\mathrm{tot}\)) to \(T\) and the clock frequency \(f_\mathrm{clk}\), and is defined as A\(_\mathrm{eff}=(T\cdot n^\mathrm{(max)}_{it}/f_\mathrm{clk})\cdot (1000/\)An\(_\mathrm{tot})\), where the 1000 multiplication factor changes the unit of measure from \(\Big (\frac{ \mathrm{bits} }{ \mathrm{mm}^2\cdot \mathrm{cycles}}\Big )\) to \(\Big (\frac{{ \mathrm bits} }{ \mathrm{mm}^2\cdot \mathrm{kcycles} }\Big )\). The +13.1 % in A\(_\mathrm{tot}\) that A \(_\mathrm{new}\) exhibits w. r. t. A \(_\mathrm{ref}\) leads to a lower A\(_\mathrm{eff}\). The effects of the proposed methods, though, can be really appreciated by observing \(Pow\) and \(E_\mathrm{eff}\). The reduction of the NoC clock frequency to 200 MHz overcompensates the increased peak power consumption caused by the additional logic, leading to improved \(E_\mathrm{eff}\) values.

Table 5 LDPC/Turbo architectures comparison: CMOS technology process (Tp), total area occupation (A\(_\mathrm{tot}\), normalized area occupation for 90nm technology (An\(_\mathrm{tot}\)), clock frequency (\(f_\mathrm{clk}\)), peak power consumption (Pow), energy efficiency (\(E_\mathrm{eff}\)), maximum number of iterations (\(n^\mathrm{(max)}_{it}\)), code length (\(N\)) and rate (\(r\)), interleaver size (\(K\)) and throughput (\(T\)), Area efficiency (A\(_\mathrm{eff}\))

The multi-standard LDPC and turbo presented in [1] have a very small area occupation, with uneven throughput between LDPC and turbo mode. This situation leads to very high maximum A\(_\mathrm{eff}\) and E\(_\mathrm{eff}\) in LDPC mode. A \(_\mathrm{new}\), instead, is far more efficient in turbo mode, and yields better results also in [1] LDPC mode worst case operating conditions (\(N\) = 672, \(r=1/2\), 20 iterations). The flexible LDPC/turbo decoder designed in [16] and A \(_\mathrm{new}\) achieve comparable throughputs when decoding LDPC codes, with [16] having a larger An\(_\mathrm{tot}\). This leads to A \(_\mathrm{new}\) having slightly better A\(_\mathrm{eff}\) and E\(_\mathrm{eff}\): the gap is much larger in turbo mode.

The high parallelism, single-standard WiMAX LDPC decoder presented in [11] guarantees very high throughput with a 40 MHz frequency, that allows for reduced power consumption and great efficiencies. Though [11] has been designed for ultra-low power consumption, and A \(_\mathrm{new}\) targets multiple code types and standards, the normalized power gap between them is minimal. This work fares even better when compared to the dedicated Application-Specific Integrated Circuit (ASIC) targeting 3GPP-LTE turbo codes in [34] that yields good A\(_\mathrm{eff}\) and throughput. Since the supply voltage in [34] is different from that of A \(_\mathrm{new}\), an additional correction factor of \((1.1/1.2)^2\) has been applied to the dynamic power consumption that can be assumed to contribute to 30 % of the total power consumption in 130 nm CMOS technology. The estimated normalized Pow of 249 mW with the 90 nm node is still higher than A \(_\mathrm{new}\), with lower efficiency.

7 Conclusion

The paper proposes novel power reduction techniques for NoC-based channel decoders: extensive traffic and performance analysis are performed, while the hardware implementation allows to assess the impact on a relevant design example and to obtain accurate power consumption. By dealing with the late message delivery problem through traffic reduction and optimization, the proposed techniques allow to avoid power expensive solutions, while at the same time guaranteeing a reliable decoding process. Moreover, one of the methods can be applied to any channel decoder architecture. Simulation results show the performance of different combinations of the presented techniques: all of them are effective, in particular HI+SI+U+BR, that allows to reach the target throughput with minor BER degradation also in the presence of a late message percentage of more than 20 % of the total. The implementation of this method on a known decoder allows for a 15 % total power reduction (40.2 % NoC power reduction): it is presented and compared to the state of the art. Post place and route results show small area occupation and very low power consumption, with good efficiency measures w.r.t. even much less flexible decoder.