Pipelined VLSI Architecture using CORDIC for Transform Domain Equalizer

Banerjee, Ayan; Dhar, Anindya Sundar

doi:10.1007/s11265-012-0657-7

Pipelined VLSI Architecture using CORDIC for Transform Domain Equalizer

Published: 17 February 2012

Volume 70, pages 39–48, (2013)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Signal Processing Systems Aims and scope Submit manuscript

Pipelined VLSI Architecture using CORDIC for Transform Domain Equalizer

Download PDF

Ayan Banerjee¹ &
Anindya Sundar Dhar¹

646 Accesses
11 Citations
Explore all metrics

Abstract

In this paper, a pipelined architecture using CORDIC for realization of transform domain equalizer is presented. Transform domain equalizer has much faster convergence than its time domain counterpart for practical hardware realization having nonzero adaptation delay. Here running DFT is employed as the transform, and CORDIC is used for realization of running DFT. Pipelining is applied throughout the architecture, thus limiting the critical path delay to the propagation delay of a single 16 bit adder for 16 bit arithmetic. For N tap equalizer, primary clock speed is N times of the sample clock speed, so that on arrival of each sample, the computation of whole transform and weight update is possible. In the proposed architecture, hardware complexity is reduced by fully utilizing the pipeline without using parallel structures. The adaptation delay is only 2 sample clock periods resulting in fast convergence. The proposed architecture is suitable for VLSI implementation with primary clock speed limited by the binary adder propagation delay which could be as low as 2 ns in the present state-of-the-art technology.

High-Throughput VLSI Architectures for CRC-16 Computation in VLSI Signal Processing

High-Throughput VLSI Architectures for VLSI Signal Processing

An Optimized VLSI Implementation of the Least Mean Square (LMS) Adaptive Filter Architecture on the Basis of Distributed Arithmetic Approach

Article 15 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In a communication system, an adaptive equalizer is employed to minimize the intersymbol interference by adaptively trying to mimic the inverse characteristics of that of communication channel. Least mean square (LMS) algorithm has been among the most popular ones to be utilized for the process of adaptive equalization [1]. The speed of convergence for the LMS algorithm depends on various factors such as input data statistics (e.g. eigenspread of the covariance) and stepsize parameter etc. It has been established [2] that, in a given practical situation, transform domain LMS (TXLMS) converges faster than the conventional counterpart. Here delayed LMS algorithm should be used, as delayless LMS cannot be physically realized due to inherent delay in hardware resources. However, computational cost increases considerably in TXLMS due to its requirement for evaluating associated transform. Thus it becomes important to realize efficient architecture that can implement TXLMS based adaptive equalizer, which could be used in a functional unit in a practical communication system.

This paper describes an architecture for TXLMS based adaptive equalizer where the transform concerned is employed through running discrete Fourier transform (DFT) [3] which operates on both real and imaginary data. The proposed architecture employs Co-ordinate Rotation Digital Computer (CORDIC) [4] unit in a pipelined configuration, as the primary functional block for computing running DFT, thereby saving the usage of conventional bulky multipliers. However, in the weight updating section, pipelined multipliers have been employed. Here running DFT acts as a sliding window, making it possible to update the weights in the transform domain on arrival of each input sample. In the proposed architecture, hardware economy is achieved by exploiting the pipelining technique to the maximum extent without using any parallel structure. Excepting the size of the RAM, hardware requirement for the proposed architecture is invariant to the equalizer tap length.

2 Theoretical Groundwork

For a transmitted signal s(t), in a communication system, the received signal r(t) is given by

$$ r(t) = s(t) \otimes h(t) $$

(1)

where h(t) is the channel impulse response and ⊗ denotes the convolution operation. Now, if the received signal is passed through an equalizer, its output y(t) is given by

$$ y(t) = r(t) \otimes q(t) $$

(2)

where q(t) is the impulse response of the equalizer. The responsibility of the equalizer is to adopt q(t) in such a way that the originally transmitted signal s(t) can be recovered from y(t) by a simple decision making process. This is possible in a digital communication system, since the transmitted signal can have discrete values representing symbols from a specified constellation having finite number of points in it. The equalizer output can be quantized to map to the correct (as transmitted) constellation point provided that the error is contained such that the decision boundary remains distinct.

A conventional N-tap time domain equalizer uses N number of weights to adaptively attain the required impulse response, as closely as possible, within the scope of the adaptation algorithm. Let V(n) = [v ₀(n) v ₁(n) …….. v _N−1(n)] be the values of the N weights at discrete time index n and R(n) = [r(n) r(n − 1) …….. r(n − (N − 1))] be the present and past received signals appearing at the same time index n, at the so called tapped delay line which is nothing but a FIFO like register array.

The output of the time domain equalizer is given by

$$ y(n) = V(n){{R}^T}(n) $$

(3)

The error in estimating the actually transmitted signal s(n) is given by

$$ e(n) = s(n) - y(n) $$

(4)

In the LMS algorithm, the error e(n) is utilized to update the weights as per the following expression:

$$ V\left( {n + 1} \right) = V(n) + \mu {{R}^{*}}(n)e(n) $$

(5)

where μ is the stepsize parameter that governs the speed of convergence and * denotes complex conjugate. Choosing the value of μ beyond certain limit may lead to instability and divergence.

In the transform domain equalizer [5] of length N, as shown in Fig. 1, a suitable orthogonal transformation is applied to the input vector R(n) made up of a block of N input samples such that the transformed output vector (having N elements) U(n) is given by

$$ U(n) = R(n)F(n) $$

(6)

where F(n) is a unitary matrix of dimension N × N representing the chosen transform.

The transformed output will now be combined by a set of N weights, say W(n) = [w ₀(n) w ₁(n) …….. w _N−1(n)] to form the final output as follows:

$$ y(n) = W(n){{U}^T}(n) $$

(7)

The error in estimating the transmitted signal s(n), in this case also, is given by Eq. 4. The weights are updated by following an approach similar to Eq. 5:

$$ W\left( {n + 1} \right) = W(n) + \mu \prime {{U}^{*}}(n)e(n) $$

(8)

where μ′ is the stepsize parameter for transform domain LMS algorithm.

In a practical hardware realization of the equalizer, while resorting to pipelined architecture for supporting high data rate, a delay is inadvertently introduced in updating the weights with respect to the input signal. Under such cases, to incorporate the effect of this delay D, the governing equations are getting slightly modified [6] as follows:

$$ \begin{gathered} y\left( {n - D} \right) = V\left( {n - D} \right){{R}^T}\left( {n - D} \right) \hfill \\ e\left( {n - D} \right) = s\left( {n - D} \right) - y\left( {n - D} \right) \hfill \\ V\left( {n + 1} \right) = V(n) + \mu {{R}^{*}}\left( {n - D} \right)e\left( {n - D} \right) \hfill \\ \end{gathered} $$

(9)

This delay causes an offset between the calculated weights and the samples on which the weights act to produce the output, thereby resulting in slow convergence compared to the case where there is no computational delay. However, this does not pose a serious problem in achieving the final objective of equalizing the channel, provided that the signal statistics do not change by a large extent within the time span of such weight updating delay.

In this circumstance, transform domain equalization helps in accelerating the convergence process. After accommodating the delay, the set of governing equations for transform domain equalization becomes:

$$ y\left( {n - D} \right) = W\left( {n - D} \right){{U}^T}\left( {n - D} \right) $$

(10a)

$$ e\left( {n - D} \right) = s\left( {n - D} \right) - y\left( {n - D} \right) $$

(10b)

$$ W\left( {n + 1} \right) = W(n) + \mu \prime {{U}^{*}}\left( {n - D} \right)e\left( {n - D} \right) $$

(10c)

Running DFT is chosen for our implementation of the transform domain equalizer and the former is computed with the help of CORDIC arithmetic module.

In the running DFT for N point transform length, U _i(k), the k-th frequency component at i-th time instant, can be updated as follows:

$$ U_{{i + 1}} {\left( k \right)} = {\left[ {U_{i} {\left( k \right)} - R{\left( {i - N + 1} \right)} + R{\left( {i + 1} \right)}} \right]}e^{{2\pi jk/N}} $$

(11)

where R(i + 1) is the (i + 1)-th data sample. Multiplication by the factor e ^2πjk/N, i.e., complex rotation is carried out by CORDIC operation.

In the CORDIC technique, the plane rotation through target angle α is done by decomposing it into several elementary angles and rotating through each of these angles as follows:

$$ \alpha = \sum\limits_{{i = 0}}^{{M - 1}} {{{\sigma }_i}{{\theta }_i}} {\text{ where }}{{\theta }_i} = {{\tan }^{{ - 1}}}\left( {{{2}^{{ - i}}}} \right) $$

(12)

with M being the wordlength and σ _i = +1 or −1, deciding the direction of rotation. Plane rotation is achieved by pipelined operations as follows:

$$ \matrix{{*{20}{c}} {{{X}_{{i + 1}}} = {{X}_i} - {{\sigma }_i}{{Y}_i}{{2}^{{ - i}}}} \hfill \\ {{{Y}_{{i + 1}}} = {{\sigma }_i}{{X}_i}{{2}^{{ - i}}} + {{Y}_i}} \hfill \\ {{{Z}_{{i + 1}}} = {{Z}_i} - {{\sigma }_i}{{\theta }_i}{\text{ with }}{{Z}_0} = \alpha, {{\sigma }_i} = {\text{Sign}}\left( {{{Z}_i}} \right)} \hfill \\ } $$

(13)

If operations in M number of pipelined stages are carried out as in Eq. 13, then the end results are scaled version of actual results (X _act, Y _act) as follows:

$$ \matrix{{*{20}{c}} {{{X}_M} = {{K}_M}{{X}_{{act}}}} \hfill \\ {{{Y}_M} = {{K}_M}{{Y}_{{act}}}{\text{ where }}{{K}_M} = 1/\prod\limits_{{i = 1}}^{{M - 1}} {\cos {{\theta }_i}} } \hfill \\ } $$

(14)

3 Architecture Design

The proposed architecture for the transform domain equalizer is shown in Fig. 2. Here tap length N equals 32 and 16 bit internal arithmetic (M) is chosen. On arrival of each complex input (R _n), the running DFT block computes N frequency components as outputs. The N complex outputs of the running DFT block are fed successively to one input of complex pipelined multiplier, the other input is coming from the weight block. In the complex multiplier, N complex weight values are multiplied with corresponding transformed values from running DFT block. The N complex outputs from multiplier are accumulated to generate output as in Eq. 10a.

The output is then subtracted from the desired response to generate the error signal as in Eq. 10b. The weight block is a dual port RAM along with two address generators, where initially first set of weight values is written. When weight from a particular location, say (n − i) location, is read out for multiplication to generate y(n), at the same time weight updating operation is going on at first by reading weight from (n − i − k) location, keeping an offset from the other read operation, and updated weight after operation as per Eq. 10c is written back to the same (n − i − k) location. Thus two weights are simultaneously accessed from two different address locations, with a fixed offset between them. The error signal is fed to another pipelined multiplier where complex conjugate operation is done. The other input to the pipelined multiplier is running DFT output after passing through a proper synchronization delay. The output of the conjugate multiplier is multiplied by stepsize μ′ through a variable right shifter as μ′ can be expressed here in the form 2⁻ⁱ. Then it is added with previous weight, thus performing operation as per Eq. 10c to generate updated weight. The decision block is generating final output by taking decision as per structure of constellation points. Here the primary clock speed is f _max = 1/T _delay(adder) and the operations such as pipelined multiplications, accumulation operations to get y(n), weight updating, running DFT internal operations are by primary clock and for N transform length the maximum sampling clock speed is f _max/N. The adaptation delay corresponds here to a maximum of two sampling clocks.

The main processing unit in this architecture is the running DFT block of transform length N that equals 32. Here the updating operation for a particular frequency component, as shown in Eq. 11, can be achieved through only an addition, a subtraction and a complex rotation amounting to 2πk/N for the k-th frequency component. Now the complex rotation is performed with the help of pipelined CORDIC module whose functionality is described in Section 3.A. The architecture of running DFT, shown in Fig. 3, processes real and imaginary data in two similar sections each comprising of two RAM blocks, adders, subtractors, scale blocks, two data registers, and pipelined CORDIC block, which is common for both the sections. Here the primary clock speed is N times the sample clock speed so that on arrival of each sample, all N frequency components can be computed. The input RAM blocks (1RX and 1RY) are single port RAM, which can store N words of 16-bit length. The data is written in the input RAM and initially all the flip-flops and counters are cleared to zero. The general scheme for subtraction of (i–N)-th data from i-th data is as follows. Since the RAM can store N data, so the address of writing i-th data is the same where previously (i–N)-th data is written. For a given address, first the previously stored data is latched in data register (1DX(Y)) then the new data is written in the same location in the RAM. The latched data is subtracted from the new data by subtractor to obtain (R(i + 1) − R(i–N + 1)). For each input arrived at sample clock rate (fixed i), N different rotation angles (different k values) are fed to the CORDIC block successively at primary clock rate so that rotation corresponding to all N angles is achieved and N updated frequency components (real and imaginary) are available at the output in successive clock cycles after the latency period of the CORDIC processor. After the rotation operation, CORDIC outputs (real & imaginary) are fed to the scale blocks (X and Y) for multiplying it by a factor $ \prod\limits_{{i = 0}}^{{15}} {Cos\theta i} = 0.{6}0{7252935} $, to get the actual value of the rotated output. After the scaling operation is done, the outputs are written in the dual port RAM blocks (2RX and 2RY). The data written corresponding to a particular frequency is to be read out again after the arrival of the next input data during the computation of the corresponding frequency component. This is done by keeping an offset between read and write addresses of the dual port RAM to cover the latency of the scale and CORDIC block. RAM block outputs are fed to adder, the other input of which is the new subtracted data (R(i + 1) – R(i–N + 1)). This subtracted data remaining same as one of the input of the adder, whereas the other inputs are the N frequency components corresponding to the previous input. The outputs of adder are fed back to the CORDIC block for the next set of rotation operations. The clock in the input RAM is the sample clock, which is obtained by a division of N of the primary clock, which is used in the other portion of the running DFT block.

3.1 Pipelined CORDIC Architecture

The CORDIC operation is done here through a pipelined structure, as shown in Fig. 4. There are three inputs at a time, real data (X), imaginary data (Y) and target angle (α). The total rotation is achieved here by decomposing the target angle α in elementary angles tan⁻¹(2⁻ⁱ), i starting from 0 to 15, and also in clockwise or anticlockwise direction for a particular i, as required. But as per normalized angle representation scheme [7], value of tan⁻¹(2⁻¹³) is equal to weight of LSB only, so rotation angle is decomposed up to this value. For rotation of each of these 14 angles, we have to perform three addition/subtraction operations along with right shifting (in two cases) as per Eq. 13. Since the amount of shift is fixed for a particular stage, it can be realized by bus cross connection only, without deploying additional hardware. The operation proceeds through stages in a pipelined structure, each stage performing a defined amount of rotation by taking data set from its previous stage, and new data set is entering in very first stage in each clock cycle. However, at the time of loading the target angle in the very first stage, the corresponding register copies MSB − 1 bit as the MSB instead of the original value of MSB which has got an weightage equals to −π. This is done so as to keep the target angle α within the range [−π/2, π/2]. If the target angle is beyond that limit, i.e. if it lies either in the second or in the third quadrant, the sign changing of the output variables X and Y are to be carried out to incorporate the reflections about the axes, as described in [7]. The controlled sign change for the two outputs are done through controlled two’s complementers. The pipe fills in 15 clock cycles, which defines its latency and then we get one set of outputs (X and Y) in every clock cycle, which defines its throughput. The maximum combinatorial delay in this CORDIC block for 16 bit internal arithmetic is delay of a 16 bit adder, which also defines its maximum clock speed as f _max = 1/T _delay(adder)

3.2 Scale Block Architecture

The scale block is designed for a fixed amount of multiplication of the input and it is also implemented in a pipelined structure. Let us consider the structure of the scale block, shown in Fig. 5, which multiplies the input by the factor 0.607252935. So here the relation between the input and the output is as follows: dout = din × 0.607252935 = din.(2⁻¹ + 2⁻³ − 2⁻⁶ − 2⁻⁹ − 2⁻¹² + 2⁻¹⁴), assuming 16 bit accuracy. Here the addition/subtraction operation is done in three stages. In the first stage, input din is right shifted by appropriate amounts (through RSH blocks by bus cross connection only), and then necessary additions/subtractions are carried out in this stage and also in the second and third stage, to get final output. This scale block is a pipeline structure and the operation proceeds here stage by stage in successive clock cycles. So the latency of the scale block is 3 clock cycles and throughput is one output per clock cycle. The maximum combinatorial delay in this scale block for 16 bit internal arithmetic is delay of a 16 bit adder which also defines its maximum clock speed as f _max = 1/T _delay(adder).

3.3 Pipelined Multiplier Architecture

There are two multipliers used in this design, one is complex and the other is complex conjugate multiplier, each input is 16-bit signed binary. The basic module is 8-bit × 8-bit unsigned binary multiplier, which is shown in Fig. 6. There are four addition operations done in parallel in level1 (L1) by taking the AND-ed inputs (a_ib_j) as shown below:

Four 8-bit adders ADDER0 to ADDER3 are used in level1 pipelining, whose outputs are latched in REGISTER0 to REGISTER3. Here S_ij is the j-th sum bit of i-th adder and C_i is the carry output of i-th adder. In level2 (L2) pipelining, two parallel addition operations are done by taking required portions of output from level1 by two 10-bit adders ADDER4 and ADDER5 and latched in REGISTER4 and REGISTER5. In each of ADDER4 and ADDER5, the output is limited to 10 sum-bit since for one of the inputs, MSB and MSB − 1 are always zero and thus no final carry bit is generated. In level3 (L3) pipelining, one addition operation is done by taking required portions of output from level2 by one 12-bit adder, ADDER6 and latched in REGISTER6. In ADDER6 also, output is limited to 12-bit and no final carry is generated. Final 16-bit output is taken from REGISTER6. This basic module is used to realize 16 × 16 bit signed binary multiplier. In complex multiplier, used in the equalizer, complex numbers (a + jb) and (c + jd) are multiplied to generate output (ac − bd) + j(ad + bc) using four multipliers, one adder and one subtractor as shown in Fig. 7. In realization of complex conjugate multiplier, role of adder and subtractor is interchanged. The latency for this basic module is 3 primary clock cycles and throughput is one per clock cycle and critical path delay is limited to the propagation delay of a 16-bit adder/subtractor.

4 Results and Discussion

The novelty of the proposed architecture lies in utilizing running DFT for transform domain equalization. The running DFT architecture processes both real and imaginary data at its input. This architecture is fully pipelined with throughput of one set of real and imaginary data per clock cycle and on arrival of each new input, it computes all 32 frequency components for 32 point running DFT operation in 32 clock cycles. The running DFT architecture, reported in [8], processes only real data at its input and throughput is 20 clock cycles per output and computes only 1 frequency component for single carrier and 8 frequency components for multi carrier systems. The adaptation delay for the transform domain equalizer is 2 sample clock cycles in the architecture realized here. The MSE convergence rate for time domain delayed LMS and transform domain LMS taking an adaptation delay of 2 clock cycles are compared with LMS (without delay), as shown in Fig. 8, having stepsize values 0.00012, 0.00001 and 0.0005 respectively for 64 point QAM input having P and Q values {−7, −5, −3, −1, 1, 3, 5, 7}. The stepsize values are so chosen that the convergence and stability are guaranteed satisfying the constraints as described in [5]. Since the hardware realization for delayless LMS equalizer is not feasible in practice, the transform domain equalizer has got much faster convergence rate than its time domain counterpart. The critical path delay in all the blocks used in the equalizer architecture is limited to the propagation delay of a 16 bit binary adder/subtractor for 16 bit internal arithmetic. The sample clock speed is limited by f _primary/32 for 32 tap equalizer, and f _primary/N for N tap equalizer and the adaptation delay is 2 sample clock cycles for N tap equalizer and thus same in terms of sample clock for any N. The hardware utilization of various blocks in terms of 16 bit adder, data register and RAM bit is shown in Table 1. Thus with increasing tap length only RAM size is increasing, other hardware requirements remaining same. The comparison between various architectures is shown in Table 2. The proposed architecture requires less hardware resource, than as in [9, 10], without sacrificing convergence rate. Proposed architecture is suitable for implementation in FPGA and ASIC, and critical path delay may be as low as 2 ns as reported in [11].

Table 1 Hardware utilization of proposed architecture.

Full size table

Table 2 Comparison between various architectures.

Full size table

References

Qureshi, S. U. H. (1985). Adaptive equalization. Proceedings of the IEEE, 73(9), 1349–1387.
Article Google Scholar
Narayan, S. S., & Peterson, A. M. (1981). Frequency domain LMS algorithm. Proceedings of the IEEE, 69(1), 124–126.
Article Google Scholar
Liu, J.-C., & Lin, T.-P. (1988). Running DHT and real-time DHT analyzer. Electronics Letters, 24(12), 762–763.
Article MathSciNet Google Scholar
Haviland, G., & Tuszynski, A. (1980). A CORDIC arithmetic processor chip. IEEE Transactions on Computers, C-29(2), 68–79.
Article Google Scholar
Narayan, S. S., Peterson, A. M., & Narasimha, M. J. (1983). Transform domain LMS algorithm. IEEE Transactions on Acoustics, Speech, & Signal Processing, ASSP-31(3), 609–615.
Article Google Scholar
Long, G., Ling, F., & Proakis, J. (1989). The LMS algorithm with delayed coefficient adaptation. IEEE Transactions on Acoustics, Speech, & Signal Processing, 37(9), 1397–1405.
Article MATH Google Scholar
Banerjee, A., Dhar, A. S., & Banerjee, S. (2001). FPGA realization of a CORDIC based FFT processor for biomedical signal processing. Microprocessors and Microsystems, 25(3), 131–142.
Article Google Scholar
Banerjee, A., & Dhar, A. S. (2005). Novel architecture for QAM modulator-demodulator and its generalization to multicarrier modulation. Microprocessors and Microsystems, 29(7), 351–357.
Article Google Scholar
Santha, K. R., & Vaidehi, V. (2004). Design of synchronous and asynchronous architectures for DFT based adaptive equalizer (pp. 383–389). Proc. IEEE Conf. SoutheastCon, Greensboro, NC, USA, 26–29 Mar.
Glentis, G., & Georgoulakis, K. (2006). Pipelined architectures for the frequency domain linear equalizer. International Journal of Applied Mathematics and Computer Science, 16(4), 525–535.
MATH Google Scholar
Yu, Z., Yu, M. -L., & Willson, A. N. Jr. (2001). Signal representation guided synthesis using carry-save adders for synchronous data-path circuits (pp. 456–461). Proc. 38th Design Automation Conf., Las Vegas, NV, USA, 18–22 June.

Download references

Author information

Authors and Affiliations

Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology, Kharagpur, 721302, India
Ayan Banerjee & Anindya Sundar Dhar

Authors

Ayan Banerjee
View author publications
You can also search for this author in PubMed Google Scholar
Anindya Sundar Dhar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ayan Banerjee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Banerjee, A., Dhar, A.S. Pipelined VLSI Architecture using CORDIC for Transform Domain Equalizer. J Sign Process Syst 70, 39–48 (2013). https://doi.org/10.1007/s11265-012-0657-7

Download citation

Received: 26 June 2009
Revised: 20 March 2010
Accepted: 17 January 2012
Published: 17 February 2012
Issue Date: January 2013
DOI: https://doi.org/10.1007/s11265-012-0657-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Pipelined VLSI Architecture using CORDIC for Transform Domain Equalizer

Abstract

Similar content being viewed by others

High-Throughput VLSI Architectures for CRC-16 Computation in VLSI Signal Processing

High-Throughput VLSI Architectures for VLSI Signal Processing

An Optimized VLSI Implementation of the Least Mean Square (LMS) Adaptive Filter Architecture on the Basis of Distributed Arithmetic Approach

1 Introduction

2 Theoretical Groundwork