1 Introduction

One of the major obstacles present in digital communication system (Barry et al. 2012; Proakis and Manolakis 1996) is channel distortion which is caused due to inter-symbol interference (ISI). High speed communication systems are needed in DSP when data rates are increased in transmission systems. The speed in DSP systems is limited up to 1 GHz. So, new architectures are needed to operate multigiga bit data rates. Pipelining and parallel processing are two schemes which are successfully employed to increase the computation speed in the digital communication system (Parhi 2004). The distortions like ISI in transmission system can be reduced by introducing equalizers (Lin et al. 2012, 2006; Oh and Parhi 2006) at receiving side of transmission system. The two classes of equalizers are linear equalizer and non-linear equalizer. Figure 1 shows the classification of equalizers. The performance of non-linear equalizer is better than linear equalizers. Decision feedback equalizer (DFE) and maximum likelihood sequence detector (MLSD) are two non linear eminent techniques used in equalizers to combat ISI errors and improve the signal to noise (SNR) ratio by removing the post-cursor ISI at the output. It also eliminates the noise amplification problems which are caused due to spectral nulls. Apart from the earlier eminences it poses a severe problem in speed. The speed of the equalizer is limited due to its iteration bound in the feedback filter (FBF). Several techniques have been implemented in past to enhance the speed and reduce the hardware architecture. In Lin et al. (2006), Lin et al. (2007) the author designed the DFE architecture by removing feedback loop and introducing multiplexers and the comparator to enhance the speed and decrease the chip area, later the author (Parhi 2005), enlarged the work to parallelization and unfolding techniques to reach the target of gigabit system . In paper (Lin et al. 2006; Parhi and Messerschmitt 1989) the authors described look-ahead pipelined multiplexer loops based DFE. In paper (Khan and Ahamed 2016) the author implements the DFE using concurrent look ahead scheme to alleviate the hardware cost. The hardware complexity of the DFE is linearly increased and the performance of the DFE architecture is decreased. To surmount the performance problem, distributed arithmetic architecture (Venkatachalam and Ko 2018; Khan and Ahamed 2016; Prakash et al. 2016) is introduced in DFE to mitigate the speed limitation in DFE. In paper (NagaJyothi and Sridevi 2019) the author proposed memory less distributed arithmetic based FIR filter for decision feedback equalizer.

Fig. 1
figure 1

Classification of equalizers

Now-a-days in DSP, there is a growing interest in Distributed Arithmetic (DA) architecture (Yoo and Anderson 2005; NagaJyothi and SriDevi 2017; Grande and Sridevi 2017; Jyothi and Sriadibhatla 2019) due to its multiplier less architecture and it uses only LUTs and shift accumulate block for partial product. The filter coefficients and input signals of DA are of two’s complement or offset binary code form. DA was first introduced by Croisier later mathematical calculation for DA was developed by White (1989) and Peled and Liu (1974). Chen has given a RAM-based approach for implementing FIR filter which results in low memory requirements. The memory partitioning and multiple memory bank approach have been suggested for FIR filtering in order to reduce the memory requirement of DA-based FIR filter. Choi proposed the DA-based FIR filter structure using offset binary coding (OBC) to reduce the memory size by a factor of 2 Jyothi and Sridevi (2018). Yoo proposed DA-based structure for FIR filter, where memory size has been reduced at the cost of adders. A LUT decomposition scheme has been suggested to reduce LUT complexity of DA-based FIR filter structures, at the cost of few adders. Several designs have been discussed in past decades to improve the performance and decrease the area of the DA architecture when filter order increased. It is an efficient architecture for the FIR filters used in DFE.

In this paper we proposed memory less DA based FIR filter and implemented in concurrent decision feed back equalizer. By using concurrent DFE based architecture the area could be traded for higher throughput or low-power implementation.

The rest of the paper is organized as follows: Sect. 2 describes the mathematical formulation of block DA based FIR filter. Section 3 explains the proposed variable DA based block FIR filter. Synthesis results are explained in Sect. 4. Proposed architecture for DFE is explained in Sect. 5. Finally Sect. 6 concludes the paper.

2 Formulation of DA based block FIR filter

Let us assume a block FIR filter which has a block of input data L and produces block of L outputs for every cycle. Let k th block of filter output \(y_k\) is calculated as :

$$y_k=X_k \cdot d$$
(1)

The filter coefficient d is written as:

$$d=[d(0),d(1),\ldots d(N-1)]^T$$
(2)

The input matrix \(X_k\) is calculated from present input block of Length L and past data (N − 1) and are expressed as:

$$\begin{aligned} X_k= \begin{bmatrix} x(kL) &{} x(kL-1) &{}.. &{}x(kL-N+1) \\ x(kL-1) &{}x(kL-2) &{} .. &{} x(kL-N)\\ .&{}.&{}..&{}.\\ .&{}.&{}..&{}.\\ x(kL-L+1)&{}x(kL-L)&{}..&{}x(kL-L-N+2) \end{bmatrix} \end{aligned}$$

The input matrix \(X_k\) of (LXN) size is spitted into N/2 matrices \(S_k^j\) of size (LX2) each and the filter coefficient vector h is spitted into (N/2) filter coefficient vector \(u_j\) of size 2, for \(0\le j\le (N/2)-1\). The computation of equation.1 is expressed as the sum of M matrix vector products:

$$y_k=\sum _{j=0}^{(N/2)-1} S_k^j u_j$$
(3)

where \(S_k^j\) and \(U_j\) is expressed as:

$$\begin{aligned} S_k^j= \begin{bmatrix} x(kL-1) &{}x(kL-2j-1) \\ x(kL-2L-1) &{} x(kL-2j-2)\\ .&{}.&{}..&{}.\\ .&{}.&{}..&{}.\\ x(kL-2j-L+1)&{}x(kL-2j-L) \end{bmatrix} \end{aligned}$$

Each filter output \(y(kL-i)\) for \(0<i<L-1\) is given as sum of N/2 inner products.

$$y(kL-i)=\sum _{j=0}^{{N/2}-1}S_k^{ij} U_j$$
(4)

\(S_k^{ij}\) being (i+1)th row of \(S_k^j\) is given by

$$S_k^{ij}=[x(kL-2j-i) x(kL-2j-i-1)]$$
(5)

The above equation will as selection lines for the multiplexers.

3 Proposed DA based FIR filter for DFE

To explain the idea of the register sharing, the input sample vectors of an FIR filter of length N = 4 is analyzed for computation of one block of four filter outputs \({y(n-3),y(n-2),y(n-1),y(n)}\). The input vector required for computation of filter outputs, \({y(n-3),y(n-2),y(n-1),y(n)}\) are x(n − 3), x(n − 2), x(n − 1), x(n), x(n − 4), x(n − 3), x(n − 2), x(n − 1), x(n − 5), x(n − 4), x(n − 3), x(n − 2), x(n − 6), x(n − 5), x(n − 4), x(n − 3)respectively. The input vectors corresponding to successive filter output are overlapped by three samples. Due to overlapping samples, only seven out of sixteen samples are different from each other, while the other 9 are overlapping samples of the above 7. The overlapping samples could be eliminated and the required overlapping samples can be sourced by sharing the register contents. In this case of block processing, one block of x(n − 3), x(n − 2), x(n − 1) and x(n) is received during a particular clock cycle. Taking this into consideration, out of the 7 non-overlapping samples, only 3 samples need to be saved in the register to generate all the sixteen samples of 4 input vectors. Therefore, 3 registers are needed by the block architecture of the FIR filter of length four which is the same as those needed by the FIR filter structure of the same length. Therefore, the register complexity of the block structure is independent of the block-size which is an important design feature of block structure. The arithmetic resource needs to be increased proportionately with the block-size in fixed-coefficient FIR filter. The area complexity of block-based FIR filter architecture is marginally less than the proportionate increase in area complexity with the block-size due to register saving, the area-delay efficiency of the hardware architecture is expected to be better for higher block-sizes. To get rid of these, we propose block-based DA structure for fixed-coefficient FIR filters. Q = 4 and 16-word ROM are the decomposition factors considered to derive variable-coefficient DA based FIR architecture.

The architecture of bit parallel variable coefficient FIR is shown in Fig. 2. It consists of one bit slice generator, one partial product generator (PPG) unit, one register array, one partial product selector (PPS) unit, one ATU and one SAT. The bit-slice generator consists of \((N - 1)\) number of B-bit registers. The PPG unit consists of \(M(2^Q - Q- 1)\) number of adders. The register array consists of \(M(2^Q - 1)\) number of \(B_0\) bit registers. The PPS unit consists of BM number of \((2^Q:1)\) size multiplexers. Each \((2^Q:1)\) size multiplexer is implemented using \((2^Q- 1)\) number of 2:1 size multiplexers. Therefore, the PPS unit involves \(BM(2^Q- 1)\) number of 2:1 multiplexers. ATU consists of B number of ATs each of \(\frac{N}{Q}\) words. Similarly, one SAT is comprised of (B - 1) adders and the same number of shifters. Therefore, the bit-parallel DA-based variable-coeffient FIR structure involves \(BM(2Q- 1)\) number of 2:1 multiplexers of bit-width B each.

The LUT uses multiplexer to select the LUT values according to the address bits available at the multiplexer select lines. When LUT size increases, multiplexer complexity increases which increases the area complexity and also the critical-path of the DA structure also increases. To solve these problems, variable block DA based FIR filter are needed.

Fig. 2
figure 2

Block diagram of bit parallel variable coefficient FIR filter

The variable block DA based FIR is shown in Fig. 3. It consist of one delay unit, one PPG block, one register-LUT unit, and L FIR blocks. The PPG block contains of \(\frac{N}{2}\) adders and receives a set of coefficient vectors and computes \(\frac{N}{2}\) sets of partial products to update the register-LUT unit for a particular filter. The register-LUT contains \(\frac{3N}{2}\) registers. The delay unit receives a set of input \(x_k\) and generates L input vectors \(x_i^k\) of length N each. Figure 4 shows internal architecture of delay block. For every cycle, L FIR blocks receive L input vectors \(x_i^k\) from delay unit and (N/2) sets of three partial product values \(rr_j\) from LUT register unit and generates L parallel filter outputs yk. Figure 5 shows the internal architecture of variable block FIR filter. The architecture consist of of \(\frac{N}{2}\) multiplexer blocks, (\(\frac{BN}{2}\)) number of adder tree block and one SAT block. The multiplexer block contains B number of 4:1 multiplexers having 2-point input vector \(s_k^ij\) as selection lines. one partial product values rrj from register-LUT block and retrieves B partial filter outputs in parallel corresponding to the B bit-slices of 2-point input vector. Therefore, all (N/2) multiplexers block receive (N/2) input vectors from \(x_i^ k\) and (N/2) sets of partial inner-product values rrj from register-LUT unit, and retrieve (BN/2) partial filter outputs in parallel. These (BN/2) partial filter outputs are added through B ATs to produce B partial filter outputs which are shift-added in SAT to obtain the block of L filter outputs. The proposed design receives a set of L inputs for every clock cycle and generates a block of L filter outputs

Fig. 3
figure 3

Variable coefficient DA based block FIR filter

Fig. 4
figure 4

Block diagram of delay unit

Fig. 5
figure 5

Internal architecture of variable coefficient DA based block FIR filter

4 Result analysis

The proposed design and existing design of DA based FIR filter are implement in Verilog HDL and for ASIC implementation results synapsis design complier is used. The proposed design is synthesized in saed 90 nm technology. It is noticed that from the Synthesis result that the proposed structure for variable-coefficient FIR involves 71% less ADP and 65% less EPS than the existing similar structures. Theoretical comparison shows that the proposed fixed coefficient structure, for block-size 8 and filter-length 32, involves eight times more ROM words, eight times more adders, two less registers, and offers eight times higher throughput-rate with same cycle-period than existing. For the same block-size and filter lengths, the proposed variable-coefficient structure involves 7.2 times more adders, the same number of registers, eight times more MUXes, and offers eight times higher throughput-rate than existing design and is shown in Tables 1 and 2. Figures 6 and  7 shows the area delay product and power delay product for the proposed and existing architectures.

Table 1 Hardware and time complexities of the proposed variable coefficient DA-based FIR filter for B=8
Table 2 Synopsys synthesis results for proposed variable coefficient DA-based FIR filter using 90nm technology
Fig. 6
figure 6

ADP of the proposed and existing variable DA based FIR filter

Fig. 7
figure 7

PDP of the proposed and existing variable DA based FIR filter

5 Implementation of proposed DA based block FIR filter for DFE

Decision feedback equalizer (DFE) is well-suited and power efficient for channels with a few dominant post-cursor ISI terms, however, the power can become prohibitive for channels with many post-cursor ISI terms. The basic block diagram of proposed DA based DFE as shown in Fig. 8. It consist of feed forward filter, feed back filter and a decision device. The FF and FB filters are designed using proposed DA based block FIR filter. The ISI errors present in the transmission signals can be nullify by using proposed design. The proposed variable DA based FIR filter has been inserted in DFE in feed forward and feed back filter of the concurrent DFE. The ISI errors are nullified during feed forward filter and other noises are removed by feed back filter.

Fig. 8
figure 8

Concurrent DFE

The proposed design has been implemented in Matlab Simulink and Xilinx System Generator too. For implementing proposed design let us consider set of channel impulse response signals, modulated with message signal using BPSK. For removing noise and ISI errors in the generated signal, the signals are passed to the adaptive DFE. Pre-cursor and anti causal part of ISI are removed by FF filter block and other noise and error signals are removed by FB filter block. The adaptive DFE will operate untill decision device of adaptive DFE propagates zero value. The bit error rate calculations has been shown in Fig. 9.

Fig. 9
figure 9

Bit error rate for proposed concurrent DFE

6 Conclusion

In this paper, we made complexity analysis of variable-coefficient DA based FIR filter to study the effect of LUT decomposition factor on DA design. Based on the complexity analysis, we proposed the full-parallel block-based DA structure for variable-coefficient DA FIR structure using . The proposed architecture process one block of L input samples and produce one block of L outputs in every clock cycle.