

# **High performance area efficient DA based FIR filter for concurrent decision feedback equalizer**

**K. Vijetha1 · B. Rajendra Naik1**

Received: 2 December 2019 / Accepted: 24 February 2020 / Published online: 3 March 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

#### **Abstract**

In this paper we proposed a novel Distributed Arithmetic (DA) based block FIR flters for design of decision feed back equalizers. Here a block FIR flter is designed using DA architecture and is implement for DFE architecture. By introducing block FIR flter architecture the throughput rate for the design is increased. The proposed distributed arithmetic architecture is implemented in application specifc integrated circuit (ASIC) Synopsis design compiler tool using SAED 90 nm technology. The application of decision feed back equalizer is implemented in Matlab Simulink and Xilinx system generator tool. The obtained results shows 71% less area delay product (ADP) and 65% less energy delay product (EDP) when compared with the existing architecture and the performance of the design is very high. By using proposed DA based DFE architecture the ISI noises can be removed and is well suited for digital communication systems.

**Keywords** FIR flter · Distributed arithmetic · Feed-back flter · Feed-forward flter equalizer · Signal transmission

# **1 Introduction**

One of the major obstacles present in digital communication system (Barry et al. [2012;](#page-5-0) Proakis and Manolakis [1996\)](#page-6-0) is channel distortion which is caused due to intersymbol interference (ISI). High speed communication systems are needed in DSP when data rates are increased in transmission systems. The speed in DSP systems is limited up to 1 GHz. So, new architectures are needed to operate multigiga bit data rates. Pipelining and parallel processing are two schemes which are successfully employed to increase the computation speed in the digital communication system (Parhi [2004\)](#page-6-1). The distortions like ISI in transmission system can be reduced by introducing equalizers (Lin et al. [2012](#page-6-2), [2006](#page-6-3); Oh and Parhi [2006](#page-6-4)) at receiving side of transmission system. The two classes of equalizers are linear equalizer and non-linear equalizer. Figure [1](#page-1-0) shows the classifcation of equalizers. The performance of non-linear equalizer is better than linear equalizers.

 $\boxtimes$  K. Vijetha vijethakura@gmail.com

B. Rajendra Naik rajendranaikb@gmail.com Decision feedback equalizer (DFE) and maximum likelihood sequence detector (MLSD) are two non linear eminent techniques used in equalizers to combat ISI errors and improve the signal to noise (SNR) ratio by removing the post-cursor ISI at the output. It also eliminates the noise amplifcation problems which are caused due to spectral nulls. Apart from the earlier eminences it poses a severe problem in speed. The speed of the equalizer is limited due to its iteration bound in the feedback flter (FBF). Several techniques have been implemented in past to enhance the speed and reduce the hardware architecture. In Lin et al. ([2006](#page-6-3)), Lin et al. ([2007\)](#page-6-5) the author designed the DFE architecture by removing feedback loop and introducing multiplexers and the comparator to enhance the speed and decrease the chip area, later the author (Parhi [2005](#page-6-6)), enlarged the work to parallelization and unfolding techniques to reach the target of gigabit system . In paper (Lin et al. [2006](#page-6-3); Parhi and Messerschmitt [1989\)](#page-6-7) the authors described look-ahead pipelined multiplexer loops based DFE. In paper (Khan and Ahamed [2016\)](#page-6-8) the author implements the DFE using concurrent look ahead scheme to alleviate the hardware cost. The hardware complexity of the DFE is linearly increased and the performance of the DFE architecture is decreased. To surmount the performance problem, distributed arithmetic architecture (Venkatachalam and Ko [2018](#page-6-9); Khan and Ahamed [2016](#page-6-8);

<sup>1</sup> Department of Electronics Engineering, Osmania University, Hyderabad, India



<span id="page-1-0"></span>

Prakash et al. [2016](#page-6-10)) is introduced in DFE to mitigate the speed limitation in DFE. In paper (NagaJyothi and Sridevi [2019\)](#page-6-11) the author proposed memory less distributed arithmetic based FIR flter for decision feedback equalizer.

Now-a-days in DSP, there is a growing interest in Distributed Arithmetic (DA) architecture (Yoo and Anderson [2005](#page-6-12); NagaJyothi and SriDevi [2017;](#page-6-13) Grande and Sridevi [2017](#page-5-1); Jyothi and Sriadibhatla [2019](#page-5-2)) due to its multiplier less architecture and it uses only LUTs and shift accumulate block for partial product. The filter coefficients and input signals of DA are of two's complement or ofset binary code form. DA was frst introduced by Croisier later mathematical calculation for DA was developed by White [\(1989](#page-6-14)) and Peled and Liu ([1974](#page-6-15)). Chen has given a RAM-based approach for implementing FIR flter which results in low memory requirements. The memory partitioning and multiple memory bank approach have been suggested for FIR fltering in order to reduce the memory requirement of DA-based FIR flter. Choi proposed the DA-based FIR flter structure using ofset binary coding (OBC) to reduce the memory size by a factor of 2 Jyothi and Sridevi ([2018\)](#page-5-3). Yoo proposed DAbased structure for FIR flter, where memory size has been reduced at the cost of adders. A LUT decomposition scheme has been suggested to reduce LUT complexity of DA-based FIR filter structures, at the cost of few adders. Several designs have been discussed in past decades to improve the performance and decrease the area of the DA architecture when filter order increased. It is an efficient architecture for the FIR flters used in DFE.

In this paper we proposed memory less DA based FIR flter and implemented in concurrent decision feed back equalizer. By using concurrent DFE based architecture the area could be traded for higher throughput or low-power implementation.

The rest of the paper is organized as follows: Sect. [2](#page-1-1) describes the mathematical formulation of block DA based FIR flter. Section [3](#page-2-0) explains the proposed variable DA based block FIR flter. Synthesis results are explained in Sect. [4](#page-4-0). Proposed architecture for DFE is explained in Sect. [5.](#page-4-1) Finally Sect. [6](#page-5-4) concludes the paper.

### <span id="page-1-1"></span>**2 Formulation of DA based block FIR flter**

Let us assume a block FIR flter which has a block of input data *L* and produces block of *L* outputs for every cycle. Let *k* th block of filter output  $y_k$  is calculated as :

$$
y_k = X_k \cdot d \tag{1}
$$

The filter coefficient *d* is written as:

$$
d = [d(0), d(1), \dots d(N-1)]^T
$$
 (2)

The input matrix  $X_k$  is calculated from present input block of Length *L* and past data  $(N - 1)$  and are expressed as:

$$
X_k = \begin{bmatrix} x(kL) & x(kL-1) & \dots & x(kL-N+1) \\ x(kL-1) & x(kL-2) & \dots & x(kL-N) \\ \vdots & \vdots & \ddots & \vdots \\ x(kL-L+1) & x(kL-L) & \dots & x(kL-L-N+2) \end{bmatrix}
$$

The input matrix  $X_k$  of (LXN) size is spitted into  $N/2$  matrices  $S_k^j$  of size (LX2) each and the filter coefficient vector *h* is spitted into ( $N/2$ ) filter coefficient vector  $u_j$  of size 2, for  $0 \le j \le (N/2) - 1$ . The computation of equation.1 is expressed as the sum of *M* matrix vector products:

$$
y_k = \sum_{j=0}^{(N/2)-1} S_k^j u_j \tag{3}
$$

where  $S_k^j$  and  $U_j$  is expressed as:

$$
S_k^j = \begin{bmatrix} x(kL - 1) & x(kL - 2j - 1) \\ x(kL - 2L - 1) & x(kL - 2j - 2) \\ \cdot & \cdot & \cdot \\ x(kL - 2j - L + 1) & x(kL - 2j - L) \end{bmatrix}
$$

Each filter output  $y(kL - i)$  for  $0 < i < L - 1$  is given as sum of *N*/2 inner products.

$$
y(kL - i) = \sum_{j=0}^{N/2 - 1} S_k^{ij} U_j
$$
 (4)

 $S_k^{ij}$  being (i+1)th row of  $S_k^j$  is given by

$$
S_k^{ij} = [x(kL - 2j - i)x(kL - 2j - i - 1)]
$$
\n(5)

The above equation will as selection lines for the multiplexers.

#### <span id="page-2-0"></span>**3 Proposed DA based FIR flter for DFE**

To explain the idea of the register sharing, the input sample vectors of an FIR filter of length  $N = 4$  is analyzed for computation of one block of four flter outputs *y*(*n* − 3), *y*(*n* − 2), *y*(*n* − 1), *y*(*n*). The input vector required for computation of filter outputs,  $y(n-3)$ ,  $y(n-2)$ ,  $y(n-1)$ ,  $y(n)$ are  $x(n-3)$ ,  $x(n-2)$ ,  $x(n-1)$ ,  $x(n)$ ,  $x(n-4)$ ,  $x(n-3)$ ,  $x(n-2)$ ,  $x(n-1)$ ,  $x(n-5)$ ,  $x(n-4)$ ,  $x(n-3)$ ,  $x(n-2)$ ,  $x(n-6)$ ,  $x(n-5)$ ,  $x(n-4)$ ,  $x(n-3)$ respectively. The input vectors corresponding to successive flter output are overlapped by three samples. Due to overlapping samples, only seven out of sixteen samples are diferent from each other, while the other 9 are overlapping samples of the above 7. The overlapping samples could be eliminated and the required overlapping samples can be sourced by sharing the register contents. In this case of block processing, one block of  $x(n - 3)$ ,  $x(n - 2)$ ,  $x(n - 1)$  and  $x(n)$  is received during a particular clock cycle. Taking this into consideration, out of the 7 non-overlapping samples, only 3 samples need to be saved in the register to generate all the sixteen samples of 4 input vectors. Therefore, 3 registers are needed by the block architecture of the FIR flter of length four which is the same as those needed by the FIR flter structure of the same length. Therefore, the register complexity of the block structure is independent of the block-size which is an important design feature of block structure. The arithmetic resource needs to be increased proportionately with the block-size in fixed-coefficient FIR filter. The area complexity of block-based FIR flter architecture is marginally less than the proportionate increase in area complexity with the block-size due to register saving, the area-delay efficiency of the hardware architecture is expected to be better for higher block-sizes. To get rid of these, we propose block-based DA structure for fixed-coefficient FIR filters.  $Q = 4$  and 16-word ROM are the decomposition factors considered to derive variable-coefficient DA based FIR architecture.

The architecture of bit parallel variable coefficient FIR is shown in Fig. [2.](#page-2-1) It consists of one bit slice generator, one partial product generator (PPG) unit, one register array, one partial product selector (PPS) unit, one ATU and one SAT. The bit-slice generator consists of  $(N - 1)$  number of B-bit registers. The PPG unit consists of  $M(2^Q - Q - 1)$  number of adders. The register array consists of  $M(2^Q - 1)$  number of  $B_0$  bit registers. The PPS unit consists of BM number of  $(2<sup>Q</sup> : 1)$  size multiplexers. Each  $(2<sup>Q</sup> : 1)$  size multiplexer is implemented using  $(2^Q - 1)$  number of 2:1 size multiplexers. Therefore, the PPS unit involves  $BM(2^Q - 1)$  number of 2:1 multiplexers. ATU consists of B number of ATs each of  $\frac{N}{Q}$ words. Similarly, one SAT is comprised of (B - 1) adders and the same number of shifters. Therefore, the bit-parallel DAbased variable-coeffient FIR structure involves  $BM(2Q - 1)$ number of 2:1 multiplexers of bit-width *B* each.

The LUT uses multiplexer to select the LUT values according to the address bits available at the multiplexer select lines. When LUT size increases, multiplexer complexity increases which increases the area complexity and also the critical-path of the DA structure also increases. To solve these problems, variable block DA based FIR flter are needed.

The variable block DA based FIR is shown in Fig. [3.](#page-3-0) It consist of one delay unit, one PPG block, one register-LUT unit, and L FIR blocks. The PPG block contains of  $\frac{N}{2}$  adders and receives a set of coefficient vectors and computes  $\frac{N}{2}$  sets of partial products to update the register-LUT unit for a



<span id="page-2-1"></span>Fig. 2 Block diagram of bit parallel variable coefficient FIR filter

<span id="page-3-0"></span>**Fig. 3** Variable coefficient DA based block FIR flter



particular filter. The register-LUT contains  $\frac{3N}{2}$  registers. The delay unit receives a set of input  $x_k$  and generates L input vectors  $x_i^k$  of length N each. Figure [4](#page-3-1) shows internal architecture of delay block. For every cycle, L FIR blocks receive

L input vectors  $x_i^k$  from delay unit and (N/2) sets of three partial product values  $rr_j$  from LUT register unit and generates L parallel flter outputs yk. Figure [5](#page-4-2) shows the internal architecture of variable block FIR flter. The architecture



<span id="page-3-1"></span>**Fig. 4** Block diagram of delay unit

<span id="page-4-2"></span>

consist of of  $\frac{N}{2}$  multiplexer blocks,  $(\frac{BN}{2})$  number of adder tree block and one SAT block. The multiplexer block contains B number of 4:1 multiplexers having 2-point input vector  $s_k^i$  as selection lines. one partial product values rrj from register-LUT block and retrieves B partial flter outputs in parallel corresponding to the B bit-slices of 2-point input vector. Therefore, all (N/2) multiplexers block receive (N/2) input vectors from  $x_i^k$  and (N/2) sets of partial inner-product values rrj from register-LUT unit, and retrieve (BN/2) partial flter outputs in parallel. These (BN/2) partial flter outputs are added through B ATs to produce B partial flter outputs which are shift-added in SAT to obtain the block of L flter outputs. The proposed design receives a set of L inputs for every clock cycle and generates a block of L flter outputs

### <span id="page-4-0"></span>**4 Result analysis**

The proposed design and existing design of DA based FIR flter are implement in Verilog HDL and for ASIC implementation results synapsis design complier is used. The proposed design is synthesized in saed 90 nm technology. It is noticed that from the Synthesis result that the proposed structure for variable-coefficient FIR involves 71% less ADP and 65% less EPS than the existing similar structures. Theoretical comparison shows that the proposed fixed coefficient structure, for block-size 8 and flter-length 32, involves eight times more ROM words, eight times more adders, two less registers, and ofers eight times higher throughput-rate with same cycleperiod than existing. For the same block-size and flter lengths, the proposed variable-coefficient structure involves 7.2 times more adders, the same number of registers, eight times more MUXes, and offers eight times higher throughput-rate than

<span id="page-4-3"></span>**Table 1** Hardware and time complexities of the proposed variable coefficient DA-based FIR filter for  $B=8$ 

| Structure       | Existing design |                                                                         | Proposed design |                         |
|-----------------|-----------------|-------------------------------------------------------------------------|-----------------|-------------------------|
| Filter order    | $N=16$          | $N = 32$                                                                | $N = 16$        | $N = 32$                |
| Adders          | 71              | 143                                                                     | 260             | 524                     |
| Registers       | 39              | 79                                                                      | 39              | 79                      |
| $2:1$ MUX       | 192             | 384                                                                     | 768             | 1536                    |
| Cycle period(T) |                 | $2T_{\text{mix}} + T_A$ $2T_{\text{mix}} + T_A$ $2T_{\text{mix}} + T_A$ |                 | $2T_{\text{max}} + T_A$ |
|                 | $+5T_F A$       | $+6T_{FA}$                                                              | $+5T_{FA}$      | $+6T_{FA}$              |
| Throughput      | 1/T             | 1/T                                                                     | 4/T             | 4/T                     |

<span id="page-4-4"></span>Table 2 Synopsys synthesis results for proposed variable coefficient DA-based FIR flter using 90nm technology



existing design and is shown in Tables [1](#page-4-3) and [2.](#page-4-4) Figures [6](#page-5-5) and  [7](#page-5-6) shows the area delay product and power delay product for the proposed and existing architectures.

# <span id="page-4-1"></span>**5 Implementation of proposed DA based block FIR flter for DFE**

Decision feedback equalizer (DFE) is well-suited and power efficient for channels with a few dominant post-cursor ISI terms, however, the power can become prohibitive for



<span id="page-5-5"></span>**Fig. 6** ADP of the proposed and existing variable DA based FIR flter



<span id="page-5-6"></span>**Fig. 7** PDP of the proposed and existing variable DA based FIR flter

channels with many post-cursor ISI terms. The basic block diagram of proposed DA based DFE as shown in Fig. [8.](#page-5-7) It consist of feed forward flter, feed back flter and a decision device. The FF and FB flters are designed using proposed DA based block FIR flter. The ISI errors present in the transmission signals can be nullify by using proposed design. The proposed variable DA based FIR flter has been inserted in DFE in feed forward and feed back flter of the concurrent DFE. The ISI errors are nullifed during feed forward flter and other noises are removed by feed back flter.

The proposed design has been implemented in Matlab Simulink and Xilinx System Generator too. For implementing proposed design let us consider set of channel impulse response signals, modulated with message signal using BPSK. For removing noise and ISI errors in the generated signal, the signals are passed to the adaptive DFE. Pre-cursor and anti causal part of ISI are removed by FF flter block and other noise and error signals are removed by FB flter block. The adaptive DFE will operate untill decision device

<span id="page-5-7"></span>

<span id="page-5-8"></span>**Fig. 9** Bit error rate for proposed concurrent DFE

of adaptive DFE propagates zero value. The bit error rate calculations has been shown in Fig. [9.](#page-5-8)

# <span id="page-5-4"></span>**6 Conclusion**

In this paper, we made complexity analysis of variablecoefficient DA based FIR filter to study the effect of LUT decomposition factor on DA design. Based on the complexity analysis, we proposed the full-parallel block-based DA structure for variable-coefficient DA FIR structure using . The proposed architecture process one block of L input samples and produce one block of L outputs in every clock cycle.

## **References**

- <span id="page-5-0"></span>Barry, J. R., Lee, E. A., & Messerschmitt, D. G. (2012). *Digital communication*. New York: Springer.
- <span id="page-5-1"></span>Grande, N. J., & Sridevi, S. (2017). Asic implementation of shared lut based distributed arithmetic in fr flter: *Proceedings of the 2017 International conference on Microelectronic Devices, Circuits and Systems (ICMDCS)*, (pp. 1–4). IEEE.
- <span id="page-5-2"></span>Jyothi, G. N., & Sriadibhatla, S. (2019). Asic implementation of low power, area efficient adaptive fir filter using pipelined da: *Proceedings of the Microelectronics, Electromagnetics and Telecommunications*, pp. 385–394. Springer.
- <span id="page-5-3"></span>Jyothi, G. N., & Sridevi, S. (2018). Low power, low area adaptive fnite impulse response flter based on memory less distributed

arithmetic. *Journal of Computational and Theoretical Nanoscience*, *15*(6–7), 2003–2008.

- <span id="page-6-8"></span>Khan, M. T., & Ahamed, S. R. (2016). Low cost implementation of concurrent decision feedback equalizer using distributed arithmetic: *Proceedings of the 2016 1st India International Conference on Information Processing (IICIP)*, (pp. 1–5). IEEE.
- <span id="page-6-3"></span>Lin, C. H., Wu, A. Y., & Li, F. M. (2006). High-performance vlsi architecture of decision feedback equalizer for gigabit systems. *IEEE Transactions on Circuits and Systems II: Express Briefs*, *53*(9), 911–915.
- <span id="page-6-5"></span>Lin, C. S., Lin, Y. C., Jou, S. J., & Shiou, M. T. (2007). Concurrent digital adaptive decision feedback equalizer for 10gbase-lx4 ethernet system. In *Custom Integrated Circuits Conference*, 2007. CICC'07. IEEE, (pp. 289–292). IEEE.
- <span id="page-6-2"></span>Lin, Y. C., Jou, S. J., & Shiue, M. T. (2012). High throughput concurrent lookahead adaptive decision feedback equaliser. *IET Circuits, Devices & Systems*, *6*(1), 52–62.
- <span id="page-6-13"></span>NagaJyothi, G., & SriDevi, S. (2017). Distributed arithmetic architectures for fr flters-a comparative review: *Proceedings of the 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET)*. (pp. 2684–2690). IEEE.
- <span id="page-6-11"></span>NagaJyothi, G., & Sridevi, S. (2019). High speed and low area decision feed-back equalizer with novel memory less distributed arithmetic flter. *Multimedia Tools and Applications*, *78*, 1–15.
- <span id="page-6-4"></span>Oh, D., & Parhi, K. K. (2006). Low complexity design of high speed parallel decision feedback equalizers: *Proceedings of the ASAP'06. International Conference on Application-specifc Systems, Architectures and Processors*. (pp. 118–124). IEEE.
- <span id="page-6-1"></span>Parhi, K. K. (2004). Pipelining of parallel multiplexer loops and decision feedback equalizers: *Proceedings of the IEEE International*

*Conference on Acoustics, Speech, and Signal Processing*, 2004. Proceedings.(ICASSP'04), vol. 5, (pp. V–21). IEEE.

- <span id="page-6-6"></span>Parhi, K. K. (2005). Design of multigigabit multiplexer-loop-based decision feedback equalizers. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, *4,* 489–493.
- <span id="page-6-7"></span>Parhi, K. K., & Messerschmitt, D. G. (1989). Pipeline interleaving and parallelism in recursive digital flters. I. Pipelining using scattered look-ahead and decomposition. *IEEE Transactions on Acoustics, Speech, and Signal Processing*, *37*(7), 1099–1117.
- <span id="page-6-15"></span>Peled, A., & Liu, B. (1974). A new hardware realization of digital flters. *IEEE Transactions on Acoustics, Speech, and Signal Processing*, *22*(6), 456–462.
- <span id="page-6-10"></span>Prakash, M. S., Shaik, R. A., & Koorapati, S. (2016). An efficient distributed arithmetic-based realization of the decision feedback equalizer. *Circuits, Systems, and Signal Processing*, *35*(2), 603–618.
- <span id="page-6-0"></span>Proakis, J. G., & Manolakis, D. G. (1996). *Digital signal processing* (Vol. 3). New Jersey: Prentice Hall.
- <span id="page-6-9"></span>Venkatachalam, S., & Ko, S. B. (2018). Approximate sum-of-products designs based on distributed arithmetic. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 99, 1–5.
- <span id="page-6-14"></span>White, S. A. (1989). Applications of distributed arithmetic to digital signal processing: A tutorial review. *IEEE ASSP Magazine*, *6*(3), 4–19.
- <span id="page-6-12"></span>Yoo, H., & Anderson, D. V. (2005). Hardware-efficient distributed arithmetic architecture for high-order digital flters. In *Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings.(ICASSP'05)* (Vol. 5, pp. 5–125).