1 Introduction

A rotation is a multiplication by a complex number whose magnitude is one, i.e., a transformation that describes a circular movement with respect to a point [7, 11, 30, 32]. The rotation of a complex number (x + iy) by an angle α can be defined as:

(1)

where X and Y are the real and imaginary parts of the result, respectively. Thus, the rotation can be written as:

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} X + iY = (x \cos \alpha - y \sin \alpha) + i (y \cos \alpha + x \sin \alpha). \end{array} \end{aligned} $$
(2)

Let C and S be representations of \(\cos \alpha \) and \(\sin \alpha \) with a finite number of bits. A rotation of complex number, i.e., a rotation in complex plane, can be presented as

(3)

where X Q and Y Q are the result of the rotation, which includes the error resulting from quantization of the coefficients. Many digital signal processing algorithms require to carry out rotations of complex numbers by the given angles with respect to the origin. This is the case with fast Fourier transforms (FFT) [26], discrete cosine transforms, and lattice filters [23].

2 Rotations in FFT

The fast Fourier transform is one of the most important tools in digital signal processing. It is based on discrete Fourier transform and used to convert time domain signals to frequency domain [3, 26, 28]. The Fourier transform is defined for continuous signals, while the modern signal processing mainly considers digital systems, and the discrete Fourier transform (DFT) is used instead. The DFT transforms a finite sequence of equally spaced samples to a corresponding frequency domain representation as follows:

$$\displaystyle \begin{aligned} X{[}k{]}=\sum_{n=0}^{N-1} x[n] W_N^{kn}, \;\;\;\;\;\;\;\; k=0,1\ldots,N-1, \end{aligned} $$
(4)

where N denotes the length of DFT, i.e., the number of points of the DFT, x[n] and X[k] are the input and output samples, respectively. Note that both the signals are discrete in nature. The complex-valued coefficients W N are called as twiddle factors and are defined as

$$\displaystyle \begin{aligned} W_N = e^{-j 2\pi / N} = \cos\left( 2\pi / N \right) - j \sin\left( 2\pi / N \right), \end{aligned} $$
(5)

where j denotes the imaginary unit.

The original signal x[n] can be recovered from X[k] with the aid of an inverse discrete Fourier transform (IDFT):

$$\displaystyle \begin{aligned} x[n]=\frac{1}{N}\sum_{k=0}^{N-1} X[k]W_N^{kn}, \;\;\;\;\;\;\;\; k=0,1, \ldots, N-1. \end{aligned} $$
(6)

The arithmetic complexity of the DFT in (4) is O(N 2). However, DFT contains redundant computations and several methods have been introduced for avoiding such a redundancy, thus reducing the complexity. Any algorithm computing DFT with less than O(N 2) complexity is called as a fast Fourier transform. The most popular FFT is the Cooley–Tukey algorithm [3], which uses divide-and-conquer paradigm to decompose DFT into a set of smaller DFTs. Especially, the Cooley–Tukey principle says that an N-point DFT (N = PQ) can be computed with the aid of a P-point DFT and a Q-point DFT. By exploiting the periodicity of the twiddle factors:

$$\displaystyle \begin{aligned} W_N^{k Q }=W_{P}^{k}; \ W_N^{kP}=W_{Q}^{k}; \ W_N^{kPQ}= 1, \end{aligned}$$

the radix-Q FFT can be expressed as:

$$\displaystyle \begin{aligned} {X}(Qk_1+k_2)&=\sum_{n_1=0}^{P-1}\sum_{n_2=0}^{Q-1} x\left(n_1+Pn_2\right)W_P^{n_1k_1}W_N^{n_1k_2} W_{Q}^{n_2k_2} ~~~~~~~~~~~~~~~~~\notag \\ &=\underbrace {\sum_{n_1=0}^{P-1}\left[\underbrace {\left( \sum_{n_2=0}^{Q-1}x(n_1+Pn_2) W_{Q}^{n_2k_2} \right ) }_{Q\text{-}point~DFT}\underbrace {W_N^{n_1k_2}}_{twiddle factor} \right] W_P^{n_1k_1}}_{P\text{-}point~DFT}. \end{aligned} $$
(7)

The two summations indexed by n 1 and n 2 are referred to as inner and outer DFTs. As a result, an N-point DFT is broken down into P-point and Q-point DFTs. The output of the inner DFT is multiplied by \(W_N^{n_1k_2}\), which is called as a twiddle factor multiplication. The scheme of decomposition is shown in Fig. 1, where the left and right sides represent the P-point inner DFT and Q-point outer DFT, respectively. Between those DFTs, a twiddle factor multiplication indicates a rotation by \(W_N^\phi = e^{-j\frac {2\pi }{N}\phi }\). The arithmetic complexity is reduced from O(N 2) of the DFT in (4) to \(O(N\log N)\).

Fig. 1
figure 1

Decomposition scheme of Cooley–Tukey FFT algorithm

If the length N is not a prime, the Cooley–Tukey principle can be applied iteratively and then the DFT is computed with the aid of several smaller DFTs. In particular, if the DFT length is a power of a prime, i.e., N = P q, then the N-point DFT can be computed with the aid of P-point DFTs constructed in q computing stages. As the resulting fast algorithm contains only P-point DFTs, it is called as a radix-P FFT.

The most popular approach is radix-2 FFT algorithm, where the DFT is decomposed recursively until the entire algorithm is computed with the aid of two-point DFTs. The advantage is that that the two-point DFT can be computed with trivial twiddle factors, thus multiplications can be avoided. Another popular algorithm is radix-4 FFT as twiddle factors contain only 1, − 1, j, and − j, thus again no multiplications are needed.

The recursive application of Cooley–Tukey principle can be done by starting from the time domain sequence, which results in a decimation-in-time (DIT) algorithm as shown in Fig. 2. In a similar fashion, the decomposition process can be started from the frequency domain sequence resulting in a decimation-in-frequency (DIF) algorithm as shown in Fig. 3. The numbers between each stage represent the rotations. Each of these values (ϕ) corresponds to a rotation by the twiddle factor [6]. It should be noted that \(W^2_{16}=W^1_{8}; W^4_{16}=W^1_{4}\).

Fig. 2
figure 2

16-point radix-2 DIT FFT flow graph. The number between the stages indicates rotation, i.e., value ϕ in twiddle factor multiplication \(W^{\phi }_{16}\)

Fig. 3
figure 3

16-point radix-2 DIF FFT flow graph. The number between the stages indicates rotation, i.e., value ϕ in twiddle factor multiplication \(W^{\phi }_{16}\)

2.1 Twiddle Factors

Each input of an FFT stage is rotated by a different angle, which is determined by the twiddle factor \(W_L^{\phi }= e^{-j2\pi \phi /L}\). The parameter L is a constant for each stage and determines the number of possible rotations in that stage. These angles are based on the division of the circumference into L equal parts. The exact angle is determined by the parameter ϕ, which is a natural number {0, …, L − 1} [7, 11, 30, 32].

The complexity of the rotator is determined by the value of L. A small value is desirable, because it results in a simple twiddle factor architecture. An example of twiddle factors W 4, W 8, and W 16 is shown in Fig. 4.

Fig. 4
figure 4

Twiddle factors: (a) \(W_4 (W^0_4, W^1_4, W^2_4, W^3_4 )\), (b) \(W_8 (W^0_8, W^1_8, \ldots , W^7_8 )\), and (c) \(W_{16} (W^0_{16}, W^1_{16}, \ldots , W^{15}_{16} )\)

The simplest rotator is W L = W 4, which includes only trivial rotations (0, \(\frac {\pi }{4}\), π, and \(\frac {3\pi }{4}\)). Trivial rotations are characterized by the fact that they can be calculated by simply exchanging the real and imaginary parts of the input and/or changing their sign [11]. Thus, the criterion used to select the algorithm is to minimize the number of large twiddle factors and maximize the number of trivial rotators (W 4), which are the cheapest ones. There are also trade-offs between the number of large (W 64 and larger) and small (W 8, W 16, and W 32) twiddle factors.

3 Rotation in Fixed-Point Arithmetic

Ideally the arithmetic operations in the FFT algorithm should be represented with infinite precision. However, in practice, temporary values are stored in registers with a finite capacity. There exist different ways to approximate real numbers using a finite number of digits. One of the most common ways for embedded systems is two’s complement fixed-point (FxP) arithmetic [5, 27].

For angle α, it is always true that: \(\cos \alpha \in [-1,1] \) and \(\sin \alpha \in [-1,1]\). Therefore, it is assumed that C and S are also numbers in the range [−1, 1] with rounding to a certain number of fractional bits, b. According to the general interpretation of the coefficients, the magnitude of the rotation is always one, independently on the number of bits, b, as it happens in the definition of the rotation.

As C and S contain b bits, they can also be considered integers in two’s complement representation in the range of [−2b−1, 2b−1]. Accordingly, the values of the cosine and sine components of the angle α will be

$$\displaystyle \begin{aligned} \begin{aligned} C &= R ( \cos \alpha + \epsilon_c ); \\ S &= R ( \sin \alpha + \epsilon_s ), {} \end{aligned} \end{aligned} $$
(8)

where 𝜖 c and 𝜖 s are the relative quantization errors of the cosine and sine components, respectively, and R is the scaling factor of the coefficients being the error rotation [9].

4 Rotations in FFT Architectures

Closer investigation of signal flow graphs of FFT algorithms indicates that the rotations in the signal flow graph have some systematic properties, which can be exploited in implementations. The way how these properties can be exploited depends on the mapping from the signal flow graph onto the processing units. The main differences are that how many samples are processed in parallel and how many different twiddle factors each rotator must support. We can identify three principal types:

  1. 1.

    Single Branch with Multiple Rotations. A general scenario is to compute rotations on the data that flow through a single branch, where different pieces of data rotated by different coefficients are shown in Fig. 5a. This scenario is mostly used in single/multiple feedback pipelined and iterative FFT architectures [11].

    Fig. 5
    figure 5

    Rotator scenarios: (a) Type 1, (b) Type 2, and (c) Type 3

  2. 2.

    Multiple Branches, Multiple Rotations for Each Branch. The most typical case consists of several branches with several rotations for each of them, as shown in Fig. 5b. This scenario is part of feed forward pipelined FFT architectures [11].

  3. 3.

    Multiple Branches, Single Rotation for Each Branch. This scenario is applied on fully parallel FFT architecture, where data flow is parallel and each branch may carry out a rotation by a different coefficient [11] as shown in Fig. 5c.

The Type 1 and Type 2 are also referred to as multiple constant rotations (MCR) and Type 3 is called as a single constant rotation (SCR) [11].

5 Rotator Units

As discussed earlier, the FFT signal flow graphs can be mapped on processing elements with different methods indicating different properties for the units. In this section, we discuss the implications to the rotators due to these mappings and identify three different classes: rotators based on general complex multipliers, constant rotators, and CORDIC rotators.

5.1 Rotators Based on General Complex Multiplier

The most popular approach to implement the rotation is to use complex multiplier and lookup table for storing the sine and cosine components of the rotation angle as shown in Fig. 6 [30]. A straightforward method is to implement the complex multiplication with four real multipliers and two adders as depicted in Fig. 7a. The complex multiplication can be realized more efficiently by exploiting the fact that this is a rotation. Such methods reduce the number of real multipliers from four to three [30]. Among them, the most alluring ones are the following:

$$\displaystyle \begin{aligned} \begin{array}{c} X = x (\cos\alpha +\sin\alpha) - (x+y) \cdot \sin\alpha; \\ Y = x (\cos\alpha +\sin\alpha) + (y+x) \cdot \sin\alpha {} \end{array} \end{aligned} $$
(9)

and

$$\displaystyle \begin{aligned} \begin{array}{c} X = (x+y) \cos\alpha - y \cdot (\cos\alpha+\sin\alpha); \\ Y = (x+y) \cos\alpha - x \cdot (\cos\alpha-\sin\alpha). {} \end{array} \end{aligned} $$
(10)

Both these cases consist of a common term in the equations for X and Y  that only need to be computed once. Thus, when (C + S) and (C − S) are precomputed, these cases require three real-valued multiplications and five real-valued additions. The architecture for (9) is shown in Fig. 7b.

Fig. 6
figure 6

Rotation architecture based on complex multiplier

Fig. 7
figure 7

Rotation based on complex multipliers: (a) standard, (b) modified I, and (c) lifting-based

Another approach can be applied only in the case of rotation, which is called lifting-based rotators [2]. The rotation in (1) can be written in matrix form as follows:

(11)

We can apply the lifting approach to the previous equation and the following form is obtained:

(12)

By further computing the matrices, the following equations are obtained:

$$\displaystyle \begin{aligned} \begin{aligned} X &= x - x \sin \alpha \tan \frac{\alpha}{2} + 2y \tan \frac{\alpha}{2} - y \tan^2 \frac{\alpha}{2} \sin \alpha ; \\ Y &= y - x \sin \alpha - y \sin \alpha \tan \frac{\alpha}{2}. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ {} \end{aligned} \end{aligned} $$
(13)

These equations can be realized with the structure depicted in Fig. 7c. This approach requires also three real-valued multiplications and three real-valued additions. However, there is no need for an additional coefficient as it simply replaces the \(\cos \alpha \) with \(\tan \frac {\alpha }{2}\) in memory.

This type of rotator can be applied in all types of FFT architecture scenarios mentioned earlier. Figure 5b and c requires more than one rotation, which are used in signal flow graphs and parallel pipelined FFT architectures, respectively. In those scenarios, the complexity of rotation memory can be reduced by applying appropriate techniques. Rotation memory is used to store the coefficients C and S of rotation. There are a number of techniques to store the coefficients in the memory according to the requirement of FFT architectures [31]. The simplest approach is to store all the coefficients of rotations in the memory without considering any optimization technique as shown in Fig. 8. This results in a large rotations memory especially for large FFTs. It should be noted that this scheme possibly stores the same rotations in several locations as the mapping is from the computing stages of FFT algorithms.

Fig. 8
figure 8

Coefficient memory for storing all rotations of twiddle factor

A possible simplification is to use an address generator that generates the row address for the corresponding angle. As a result, we only need to store the coefficients once in the memory as shown in Fig. 9. For the case L = N, we will need to store many but not all values, still using N possible words even though many can be set as “don’t cares.” Due to this fact, one can expect the resources used to the lookup table to be reduced compared to the previous approach, given that the synthesis tool can benefit from it.

Fig. 9
figure 9

Coefficient memory with address generation

Another modification, proposed in [16], is to use the well-known octave symmetry to only store twiddle factors for \(0~\le ~\phi ~\le ~\frac {\pi }{4}\). The additional cost is an address mapping circuit as discussed in the previous section as well as multiplexers to interchange the real and imaginary parts and possible negations. The main benefit is that only \(\frac {L}{8}+1\) words are required to be stored. The resulting architecture is illustrated in Fig. 10.

Fig. 10
figure 10

Coefficient memory with address generation and octave symmetry circuit

5.2 Constant Rotators

Typically the real-valued multipliers discussed in the previous section can be replaced by shift-and-add multiplication as shown in [15, 22, 39]. This is especially useful when the rotator needs to support only a single rotation angle.

Let us consider a rotation by \(\frac {\pi }{4}\), which requires only one coefficient as \(\cos \frac {\pi }{4}=\sin \frac {\pi }{4}\). Hence, each real, x, and imaginary, y, parts of the operand are multiplied separately by \(\cos \frac {\pi }{4}\) and the multiplier outputs are added and subtracted as depicted in Fig. 11. When an input signal is multiplied by one constant, an optimal single constant multiplier from [13] can be used for the implementation. The internal architecture of a constant multiplier is shown in Fig. 12, where the input operand x is multiplied with a coefficient 145, the result being 145x.

Fig. 11
figure 11

Single constant rotation by \(\cos ( \frac {\pi }{4})\)

Fig. 12
figure 12

Simplified constant multiplication

In terms of complexity, the shift operations are free as they only reduce the number of adders in the implementation of the constant multiplications. A general block diagram of the rotator is shown in Fig. 13. This consists of two constant multiplication blocks by dashed blocks. Each dashed block consists of two constants, C and S, which are implemented by shift-and-add. Hence, both the multipliers sharing the same input can be simultaneously realized using a multiple constant multiplication (MCM) technique [4].

Fig. 13
figure 13

Rotation based on constant multiplication

The constant rotators can be applied for all the FFT rotation scenarios. However, this architecture is no more efficient for large values of L; the higher the value of L, the greater number of constant coefficients are needed implying more complex shift-and-add circuits. As a result, this architecture can be used efficiently for twiddle factors W 8, W 16, and W 32 [11, 25, 32]. The maximum number of rotations in the twiddle factor set is 0, …, L − 1. As discussed earlier, thanks to the symmetries of the angles on the complex plane, for L-point twiddle factor only the L∕8 + 1 angles in the range [0, π∕4] need to be considered. This is due to the fact that the rest of rotations of twiddle factor can be formed from these angles by interchanging the real and imaginary parts of the input and output data and/or the signs of the outputs. The design of constant rotators can be divided into two main parts: the generation of the rotation coefficients and the implementation of shift-and-add circuit.

5.2.1 Rotation Coefficients

In hardware implementations, it is not only desirable to reduce the computation error but also the area on the circuit. As mentioned earlier: \(\cos \alpha \in [-1,1] \) and \(\sin \alpha \in [-1,1]\). Therefore, in general it is assumed that C and S are also numbers in the range of [−1, 1] with rounding a certain number of fractional bits, b. There is a number of methods to reduce the coefficient error without increasing the addition cost.

A method based on searching the coefficients by allowing word length increase by E fractional bits is introduced in [14]. The approximation error can be guaranteed to meet 𝜖 ≤ 2−(N+E+1) with a brute force method by searching coefficients, which fulfill this condition such that the word length increase is at most E fractional bits. Another way of using the additional fractional bits is to realize that there are exactly 2E different representable coefficients for which 𝜖 ≤ 2−(N+1), including the one obtained by rounding to b bits. The basic idea is to search these 2E and select the coefficient value that has the smallest approximation error for allowed complexity called as the addition aware quantization [14]. The allowed complexity is typically assumed to be the same number of additions as required by the coefficient rounded to a coefficient with b fractional bits. This is a generic method and it can be applied on rotation coefficients. Another method, which is well suited especially to rotations, is the so-called combined coefficient selection and shift-and-add implementation (CCSSI) [11]. This method refers to a set of rotations that must be optimized together. This joint optimization happens when there is a dependency on the scaling of the rotations. Different optimization problems can be defined for SCR and MCR depending on the scaling that is required and on the hardware layout. Thus, the scaling can be fixed, unity, or arbitrary depending on the freedom to choose the scaling factor. Unity scaling is a particular case of fixed scaling, where the rotation has magnitude of one or, in more general terms, R = 2b. This is equivalent to considering that the binary point is in a different position in the binary representation. Conversely, arbitrary scaling means that R can take any value, i.e., no restriction is set to R. For arbitrary scaling the approximation error is equal to the angular error only, since R will always take the optimal value. However, the scaling for multiple angles is classified based on the relation among the scaling factors of the rotations. More details of the design process can be found from [11].

In pipelined FFT architectures, uniform scaling can be applied on sets of rotations, which means that R is the same for all the rotations of each FFT stage. The purpose is to select the best coefficients of rotations. Thus, when fixed or arbitrary scaling factor is applied, the output of FFT is shifted by a certain factor.

5.2.2 Shift-and-Add Circuit Implementation

This section describes the methods to design the constant rotators circuit for FFT, especially for MCR. There are two main methods to design rotation circuits for FFT. One is based on using combinations of rotation coefficients for other rotation with the aid of additional multiplexers. Thus, it is reducing the coefficients of rotations. Other is merging the rotation and sharing the adders among them by using additional multiplexers.

Regarding the first technique, trigonometric identities are used to reduce the number of required coefficients. This techniques is applied on twiddle factors of W 16 and W 32 to reduce the required coefficients from three to two and seven to three, respectively [32]. Thus, the equivalent expression for all the coefficients in twiddle factors of W 16 and W 32 is tabulated in Tables 1 and 2, respectively [25, 32].

Table 1 Trigonometric identities used for W 16 twiddle factors
Table 2 Trigonometric identities used for W 32 twiddle factor

The architecture for W 16 twiddle factor is shown in Fig. 14, where a single input is multiplied with any of the coefficient pairs \(\{ (1,0), (\cos \frac {\pi }{8}, \sin \frac {\pi }{8}), (\cos \frac {\pi }{4},\sin \frac {\pi }{4})\}\).

Fig. 14
figure 14

Architecture for W 16 twiddle factors [25]

Another implementation technique is based on CCSSI. The coefficient selection has been explained in Sect. 5.2.1. The implementation consists of two steps: first, obtain the implementation of a rotation by each SCR and then merge together all the rotations that are carried out by the MCR.

The rotation by an SCR includes the multiplication of the input by C and S, and addition of the products. This corresponds to Eq. (1). The multiplication by C and S is carried out by means of shifts and additions according to the MCM representation of the numbers. Similarly, layout the architecture of all the rotations that are carried by the MCR. Finally, the rotations carried out by the same rotator must be merged together. This is done by adding multiplexers to the inputs of the adders. Figure 15 illustrates an architecture for the computing W 8 twiddle factor and Fig. 15a and b shows the multiplication by 1 and \(\frac {\pi }{4}\) rotation, respectively.

Fig. 15
figure 15

W 8 twiddle factor: (a) multiplication by 1, (b) rotation by \( \frac {\pi }{4}\), and (c) \(\{(1),(\cos \frac {\pi }{4},\sin \frac {\pi }{4})\}\)

The architecture in Fig. 15c is a result of merging together the architectures in Fig. 15a and b. The control signal of the multiplexer controls the multiplication of \(\{ (1,0), (\cos \frac {\pi }{4},\sin \frac {\pi }{4})\) [11]. In order to obtain an efficient realization of the rotator, reconfigurable single [13, 35] and multiple constant multiplication [4, 12] techniques can be used. Alternatively, when the number of coefficients is small, which is true in most of the practical cases, the selection of an efficient implementation can be found manually.

5.3 CORDIC Rotator

COordinate Rotation DIgital Computer (CORDIC) is one of the most popular algorithms for implementing multiplier-less rotations [1, 7, 8, 36]. It realizes rotation by means of a series of shifts and additions, which reduces the amount of hardware. It is also suitable for cases, where multipliers are not available. However, it may affect the accuracy since it is based on an approximation. The CORDIC algorithm decomposes the angle that has to be rotated, θ, into a sum of M predefined angles, α i, according to:

$$\displaystyle \begin{aligned} \theta = \sum_{i=0}^{M-1} \delta_i \alpha_{i} + \epsilon, \end{aligned} $$
(14)

where 𝜖 is the error of the approximation, δ i indicates the direction of the so-called micro-rotation and

$$\displaystyle \begin{aligned} \alpha_{i} = \tan^{-1}(2^{-i}). \end{aligned} $$
(15)

These angles that define the micro-rotations have the property that they can be rotated by shifts and additions, which reduces significantly the hardware resources. These micro-rotations are carried out as follows:

$$\displaystyle \begin{aligned} \begin{array}{cc} x_{i+1} = x_{i} - y_{i} \; \delta_{i} \; 2^{-i}; \\ y_{i+1} = y_{i} + x_{i} \; \delta_{i} \; 2^{-i}. \end{array} \end{aligned} $$
(16)

The hardware circuit for calculating the case of δ i = 1 is depicted in Fig. 16; input samples are rotated with an angle α i, which is chosen by setting the number of bits that are shifted before the additions and subtractions are carried out.

Fig. 16
figure 16

CORDIC micro-rotation

Usually δ ∈{−1, 1}. This forces all the micro-rotations to be computed either clockwise or counterclockwise and assures a constant gain for the CORDIC computations, which can be compensated by multiplying the outputs by:

$$\displaystyle \begin{aligned} \begin{array}{l} \displaystyle K = \prod_{i=0}^{M} \cos{}(\alpha_i) = \prod_{i=0}^{M} \cos{}(\tan^{-1}(2^{-i}))= 0.6073. \end{array} {} \end{aligned} $$
(17)

This option is preferable when the circuit is used for rotating several different angles and a constant gain for all of them is required, as happens in the rotators for the FFT. However, in a constant rotator, only a single angle θ can be rotated. In this case, it is better to consider δ i ∈{−1, 0, 1}. This approach is called as redundant CORDIC [19], which allows certain micro-rotations to be removed, reducing the number of adders.

There are multiple variations of the CORDIC algorithm. Some of the main modifications are introduced in the following. Surveys on CORDIC techniques can be found, e.g., in [1, 24]. For some of the approaches, it is not straightforward to determine the rotation parameters at run time. Hence, for these methods the design is carried out offline and the control signals are stored in memory rather than the angles. This approach is naturally possible for all techniques and, as the sequence of angles is often known beforehand, most likely advantageous compared to storing the angle values.

The redundant CORDIC considers that δ i ∈{−1, 0, 1} [34] or even δ i ∈{−2, −1, 0, 1, 2} [20]. This allows several rotation angles at each CORDIC stage. However, the scaling for different angles is different, which implies a need for a specific circuit for scaling compensation. The extended elementary angle set (EEAS) CORDIC [38] and mixed-scaling-rotation (MSR) CORDIC [21, 29] also follow the idea of increasing the number of rotation angles per rotation stage.

The memoryless CORDIC [10] removes the need for rotation memory to store the FFT rotation angles. Instead, the control signals δ i are generated from a counter. This is advantageous for large FFTs, where the butterfly stages have a large number of rotations. The modified vector rotational (MRV) RORDIC [37] allows skipping and repeating CORDIC stages, whereas the hybrid CORDIC [17, 33] divides the rotations into a coarse and a fine rotations. These techniques reduce the number of stages and, therefore, the latency of the CORDIC. The CORDIC II [10] proposes new types of rotation stages: friend angles, uniformly scaled redundant (USR) CORDIC, and nano-rotations. These result in both a low latency and a small number of adders. Finally, the base-3 rotators [18] consider an elementary angle set that is different to that of the CORDIC. All the rotations are generated by combining a small set of FFT angles. This set fits better the rotation angles of the FFT than that of the CORDIC, which results in a reduction in the rotation error, the number of adders, and latency of the circuit.

6 Conclusions

Rotation architecture has an important role in the design of an FFT architecture and has a large effect on the cost of the architecture. This chapter provided an overview over different existing architectures of the rotations especially for FFT. These can be implemented using complex-valued multipliers, constant multipliers, and the CORDIC. Architecture based on the CORDIC and constant multiplication uses shift-and-add circuit, whereas the complex multiplication generally uses complex multiplier and memory to store the coefficients of rotation.