1 Introduction

The Fourier transform is an essential method in most digital signal processing (DSP) applications, including image and signal processing, telecommunications, and optics. However, their widespread use comes at the cost of high computational demands. This is particularly true for the discrete Fourier transform (DFT), where the number of computations required grows to \(M^2\) for a transformation size of M. As M increases, the calculations become exponentially complex [1,2,3]. To address this challenge, the fast Fourier transform (FFT) was developed as an established technique in signal processing. It significantly reduces the complexity of DFT calculations from \(M^2\) to \(M\log _{2}{M}\), making it a more efficient approach for large datasets.

Based on existing literature, numerous FFT algorithms, such as the fast Hartley transform (FHT), have been proposed. However, Split-Radix, Radix-4, and Radix-2 algorithms remain the most widely employed [4, 5]. In DSP, the discrete Fourier transform (DFT) transforms a signal, denoted by y(n) in the time domain, to its frequency domain representation, Y(k). Both y(n) and Y(k) can be complex or real numbers. Calculating Y(k) for each frequency value (k) requires M complex multiplications and M complex additions. Since there are M values of Y(k), an M-point DFT necessitates M complex addition and multiplication blocks. Furthermore, the total number of addition and multiplication operations to compute this M-point DFT scales quadratically with M (\(M^2\)). This complexity becomes significant when dealing with large datasets. It is important to note that convolution in the time domain corresponds to multiplication in the frequency domain [5, 6]. Fourier transforms decompose signals into their fundamental sinusoidal components of various frequencies across the entire sequence. In contrast, wavelet transforms break down signals into localized components that have a defined location in both time and frequency. Some works have explored the combination of these techniques, known as fractal-wavelet analysis [7,8,9,10,11,12].

FFT algorithm leverages a key strategy called “divide-and-conquer.” This method breaks down the computation of a large DFT into calculations of smaller DFTs. To understand its efficiency, let us consider a signal y(n) with an even length, i.e., M is divisible by 2. Dividing y(n) into two equal-sized sub-arrays requires performing some multiplication and addition operations on each sub-array to compute their respective DFTs [13, 14]. Moreover, calculating the DFTs of both sub-arrays separately would seem to require twice the number of operations for the original DFT. The power of the FFT algorithm lies in its ability to exploit clever mathematical relationships between the DFTs of the sub-arrays and the DFT of the original signal, y(n). By utilizing these relationships, the FFT significantly reduces the total number of calculations needed compared to the conventional approach. This translates to faster computation times, making FFT a cornerstone for various DSP applications.

The classification process in FFT involves dividing the signal y(n) into smaller sub-arrays and calculating Y(k) from the DFTs of these sub-arrays [15]. This divide-and-conquer strategy has a dramatic reduction in calculations when the length of the sequences is a power of 2. In a Radix-2 FFT, the sequence y(n) with length M is divided into two sub-arrays based on their even or odd indices [13, 16]. One sub-array consists of all even-indexed samples, and the other contains all odd-indexed samples. The formulas used in subsequent sections will leverage this separation. The main contribution and novelty of this work lie in achieving increased processor efficiency and reduced resource usage. This is achieved by proposing a hardware design of FFT architectures that utilizes a new parallel prefix adder. Compared to existing architectures, the designed FFT architecture offers lower power consumption, smaller chip area and faster operation speed [16, 17].

In FFT processors, Radix-2, Radix-4 and Split-Radix algorithms are commonly used to handle complex arithmetic operations. These algorithms involve various arithmetic operations, including addition and multiplication. Addition can be implemented in two ways: serial or parallel. Serial adders, such as full adders (FAs) and ripple carry adders (RCAs), are slow due to their bit-by-bit processing, have higher power consumption, and require more time for implementation. In contrast, parallel adders are significantly faster because they add multiple bits simultaneously. Designing high-speed, low-power parallel prefix (PP) adders is crucial for efficient FFT implementations. PP adders, such as Kogge and Stone (KS) [18], Brent and Kung (BK) [19], Ladner and Fischer (LF) [20], and Han and Carlson (HC) [21], are popular choices due to their efficiency. Notably, parallel prefix adders offer a key advantage over other adder architectures. They can calculate the carry bits for all stages simultaneously, leading to faster operation. While the reference [22] explores using a Brent–Kung speculative adder for high-speed applications, this approach requires a larger area for processing due to its more complex design, potentially increasing hardware complexity.

1.1 Motivations and contributions

Traditional multiplication and addition circuits often lack optimization in terms of area usage, power consumption and processing speed. Developing high-speed FFT processors necessitates more efficient arithmetic units. This research proposes well-designed circuits that use an efficient adder to achieve improved performance. The key innovation lies in implementing Split-Radix, Radix-2, and Radix-4 FFT structures with a novel parallel prefix (PP) adder. This approach offers significant advantages over existing FFT designs using PP adders. It reduces power consumption and hardware footprint (area) while simultaneously enhancing processing speed. Furthermore, the paper presents a methodology for deploying adders on field-programmable gate arrays (FPGAs). This methodology provides a foundational building block for various DSP filters. The performance of the proposed design is evaluated based on key metrics, i.e., power consumption, delay, operating frequency, number of lookup tables (LUTs) utilized, and power-delay product. We first implemented four different conventional parallel prefix adders. To address limitations in existing designs, we then introduced a novel PP adder that achieves increased speed and lower power consumption. This novel PP adder was subsequently used to construct Split-Radix, Radix-2, and Radix-4 FFT processors. The performance of these processors was evaluated based on area, delay, and power consumption.

In this paper, we propose a novel parallel prefix adder for implementing Radix-2, Radix-4, and Split-Radix FFT algorithms. We conducted a performance analysis based on power consumption, speed, operating frequency, and resource utilization for systems using conventional adders and our proposed arithmetic structure. Simulation results demonstrate that the new parallel prefix adder exhibits impressive performance compared to these conventional structures, particularly in terms of lower power consumption and higher speed. This makes it an ideal candidate for DSP applications where high processing speed is essential. By enhancing the performance of adder blocks, we can minimize resource utilization and consequently reduce the number of computational stages required in different FFT structures. All conventional adders used for comparison were implemented using Xilinx software on an Intel core processor. Our proposed adder offers several advantages such as optimized area, high speed and lower power consumption. Notably, it achieves a power-delay product of 0.380 Wns, making it a compelling choice for designing various blocks in signal processing applications.

1.2 Organizations

The remainder of the work is organized as follows. Section 2 describes the implementation of the FFT structures, including both conventional and proposed arithmetic structures. Section 3 presents the findings from the simulations. Finally, Sect. 4 summarizes the key takeaways of the paper.

2 Methodology

In this paper, we explore the potential of various FFT algorithms, including Split-Radix, Radix-4, and Radix-2. While the Radix-2 algorithm is commonly used in FFT processors to handle complex number arithmetic operations, i.e., addition and multiplication, all three algorithms can be implemented with efficient adders. Traditional multiplication and addition circuits often lack optimization in terms of area usage, power consumption and processing speed. However, high-speed FFT processors necessitate more efficient arithmetic units. This research addresses this challenge by introducing well-designed circuits that use a novel parallel prefix adder to achieve improved performance. The key innovation lies in implementing Split-Radix, Radix-2 and Radix-4 FFT structures with this novel adder. This approach offers significant advantages over existing FFT designs using PP adders. It simultaneously reduces power consumption, minimizes hardware footprint (area), and enhances processing speed.

2.1 Radix-2 butterfly algorithm

In DFT, when M is a power of 2, we can separate the even and odd points in the DFT formula. As a result, the common DFT equation converts to the following formulas,

$$\begin{aligned} {Y_{2k}}&= \sum \limits _{n = 0}^{\frac{M}{2} - 1} {\left( {{y_n} + {y_{n + M/2}}} \right) \cdot W_{M/2}^{kn}}, \end{aligned}$$
(1)
$$\begin{aligned} {Y_{2k + 1}}&= \sum \limits _{n = 0}^{\frac{M}{2} - 1} {\left( {{y_n} - {y_{n + M/2}}} \right) .W_{M/2}^n\cdot W_{M/2}^{kn}}, \end{aligned}$$
(2)

In a radix-2 FFT, the input data are typically arranged in a sequential order. However, the output data after the FFT computation exhibits a bit-reversed order. Hardware implementations often require rearranging the bit-reversed output back to a sequential format for further processing. Figure 1 illustrates a 32-point radix-2 FFT structure. It consists of multiple stages, each containing registers and a radix-2 butterfly unit. The number of registers in each stage depends on the size of the input data. Between consecutive stages, the outputs are multiplied by complex twiddle factors [16, 23].

Fig. 1
figure 1

Different parts of 32-point FFT structure

The Radix-2 butterfly structure significantly reduces the number of multipliers and adders required for the FFT computation. For instance, without the Radix-2 FFT, the first stage necessitates 16 multipliers and 32 adders. However, using the Radix-2 FFT with the butterfly structure, only 1 multiplier and 2 adders are needed in each stage. It is important to note that the number of registers remains the same. In this work, the Radix-2 FFT implementation uses the decimation-in-time (DIT) algorithm. DIT algorithms process data in the time domain first, with the input being bit-reversed and the output in natural order. In contrast, decimation-in-frequency (DIF) algorithms process data in the frequency domain first, having the input in natural order and the output bit-reversed. The Radix-2 FFT structure is a core component of the FFT, enabling efficient computation of complex number operations within the FFT algorithm.

The synthesis output of the FFT simulation suggests that employing Radix-2 structure for a 32-point FFT would be efficient in terms of speed and area [13, 14]. The flow diagram of the Radix-2 FFT structure is depicted in Fig. 2. The multiplier block is succeeded by a PP adder block. The final output is determined by the value of \(C_{in}\). If \(C_{in}=1\), then the equal real and imaginary parts are computed. Otherwise, unequal real and imaginary parts are calculated.

Fig. 2
figure 2

Flow diagram of Radix-2 FFT structure

2.2 Radix-4 butterfly algorithm

In DFT, when M is a power of 4, we can convert \(Y_{2k}\) to \(Y_{4k}\) and \(Y_{4k+2}\) as well as \(Y_{2k+1}\) to \(Y_{4k+1}\) and \(Y_{4k+3}\) so that the following formulas are obtained [14]:

$$\begin{aligned}&Y_{4k}=\sum _{n=0}^{\frac{M}{4}-1}{\left( Y_n+Y_{n+M/4}+Y_{n+M/2}+Y_{n+3M/4}\right) \cdot W_M^0\cdot W_{M/4}^{kn}}, \end{aligned}$$
(3)
$$\begin{aligned}&Y_{4k+1}=\sum _{n=0}^{\frac{M}{4}-1}{\left( Y_n-Y_{n+\frac{M}{4}}-Y_{n+M/2}+Y_{n+3M/4}\right) \cdot W_M^n\cdot W_{M/4}^{kn}}, \end{aligned}$$
(4)
$$\begin{aligned}&Y_{4k+2}=\sum _{n=0}^{\frac{M}{4}-1}{\left( Y_n-Y_{n+\frac{M}{4}} +Y_{n+\frac{M}{2}}-Y_{n+\frac{3M}{4}}\right) \cdot W_M^{2n}\cdot W_{\frac{M}{4}}^{kn}}, \end{aligned}$$
(5)
$$\begin{aligned}&Y_{4k+3}=\sum _{n=0}^{\frac{M}{4}-1}{\left( Y_n+Y_{n+\frac{M}{4}}-Y_{n+\frac{M}{2}} -Y_{n+\frac{3M}{4}}\right) \cdot W_M^{3n}\cdot W_{\frac{M}{4}}^{kn}}. \end{aligned}$$
(6)

The graph obtained from this method for Radix-4 DIF-FFT with 16-point is shown in Fig. 3 with the conversion size \(M=16\) [24, 25].

Fig. 3
figure 3

Structure of 16-point Radix-4 FFT by signal flow graph

2.3 Split-Radix butterfly algorithm

This method is a combination of two aforementioned methods so that only in the third formula, we change \(Y_{2k+1}\) to \(Y_{4k+1}\) and \(Y_{4k+3 }\) so that the following formulas are obtained. The structure of Split-Radix 3/6 FFT by block diagram is shown in Fig. 4.

$$\begin{aligned}&Y_{2k+1}=\sum _{n=0}^{\frac{M}{2}-1}{\left( Y_n-Y_{n+M/2}\right) \cdot W_{M/2}^n\cdot W_{M/2}^{kn}}, \end{aligned}$$
(7)
$$\begin{aligned}&Y_{4k+1}=\sum _{n=0}^{\frac{M}{4}-1}{\left( Y_n-jY_{n+\frac{M}{4}} -Y_{n+M/2}+jY_{n+3M/4}\right) \cdot W_M^n\cdot W_{M/4}^{kn}}, \end{aligned}$$
(8)
$$\begin{aligned}&Y_{4k+3}=\sum _{n=0}^{\frac{M}{4}-1}{\left( Y_n+jY_{n+M/4} -Y_{n+\frac{M}{2}}-jY_{n+3M/4}\right) \cdot W_M^{3n}\cdot W_{M/4}^{kn}}. \end{aligned}$$
(9)

Split-Radix 3/6 FFT approach is defined based on the following equations:

$$\begin{aligned}&Y\left( k\right) =A_k+W_2^krW_3^kmB_k+C_k+W_3^{-k}mE_k+W_2^{-k}rW_3^{-k}mF_k, \end{aligned}$$
(10)
$$\begin{aligned}&A_k=\sum _{n=0}^{\frac{M}{3}-1}x_{3n}W_{\frac{M}{3}}^{nk}, \end{aligned}$$
(11)
$$\begin{aligned}&B_k=\sum _{n=0}^{\frac{M}{6}-1}{x_{6n}+2^r+3^m}W_{\frac{M}{6}}^{nk}, \end{aligned}$$
(12)
$$\begin{aligned}&C_k=\sum _{n=0}^{\frac{M}{6}-1}{x_{6n}+2^r}W_{\frac{M}{6}}^{nk}, \end{aligned}$$
(13)
$$\begin{aligned}&E_k=\sum _{n=0}^{\frac{M}{6}-1}{x_{6n}-2^r}W_{\frac{M}{6}}^{nk}, \end{aligned}$$
(14)
$$\begin{aligned}&F_k=\sum _{n=0}^{\frac{M}{6}-1}{x_{6n}-2^r-3^m}W_{\frac{M}{6}}^{nk}. \end{aligned}$$
(15)
Fig. 4
figure 4

Structure of Split-Radix FFT by block diagram

The structure of Split-Radix FFT by block diagram is given in Fig. 4. If the range of n is considered from 0 to \(\frac{N}{6}\) in the aforementioned equations, Y(k) will be completely obtained. The design of the suggested Split-Radix FFT is summarized. The primary input array with a length equal to M is divided into five arrays. This process for each of new sub-series is successively repeated until the size of all sub-DFT’s would not be divisible by 6 [26, 27].

We consider the performance of the suggested structure by investigating the computational complexity. Moreover, it will be compared with the other structures. We suppose that AN and MN are the number of additions and multiplications, respectively. Moreover, a 3-point DFT needs 12 real additions and 4 real multiplications [28, 29].

2.4 Conventional PP adders

PP adders significantly outperform serial adders by performing addition on multiple bits simultaneously. They offer a good theoretical foundation for balancing power consumption, speed, and occupied area in circuit design, making them ideal for designing high-performance integrated circuits due to their balance between speed, power, and area consumption (VLSI synthesis). The flow diagram of a 16-bit PP adder often includes three stages, i.e., pre-computation, prefix network, and post-computation. Each stage performs specific operations to calculate generate, propagate, carry and sum signals. There are various types of PP adders, including Kogge–Stone (KS), Brent–Kung (BK), Ladner–Fischer (LF), and Han–Carlson (HC) adders. One of the most widely used and significant adders is the Kogge–Stone parallel prefix adder (KSPP), illustrated in Fig. 5. The KSPP comprises 5 stages. In stage 1, the pre-processing stage calculates the generate (\(g_i\)) and propagate (\(p_i\)) values. Stages 2 to 4 form a prefix network where the final carry (\(c_i\)) is determined based on the \(g_i\) and \(p_i\) values from the previous stage, and the sum (\(s_i\)) is calculated in the final stage. Moreover, BKPP, LFPP and HCPP adders are designed as shown in Figs. 6, 7 and 8, respectively.

Fig. 5
figure 5

Different stages of KSPP adder

Fig. 6
figure 6

Different stages of BKPP adder

Fig. 7
figure 7

Different stages of LFPP adder

Fig. 8
figure 8

Different stages of HCPP adder

As observed from Fig. 6, the BKPP adder comprises 8 stages. The first stage serves as a pre-processing stage, calculating generate (\(g_i\)) and propagate (\(p_i\)) values for each bit position. Stages 2 to 7 form a prefix network that utilizes these values to determine the final carry for each bit. This approach is similar to the Ladner–Fischer Parallel Prefix (LFPP) adder in Fig. 7. In the LFPP adder, stage 1 also performs the pre-processing to calculate \(g_i\) and \(p_i\) values. Stages 2–4 form the prefix network stage, and the final stage computes the sum for each bit position.

The Han–Carlson parallel prefix (HCPP) adder in Fig. 8 is a hybrid design that combines elements from both the BKPP and KSPP adders. Similar to the previous adders, stage 1 performs the pre-processing for \(g_i\) and \(p_i\) values. Stages 2–5 form the prefix network, and the last stage calculates the final sum. Interestingly, the HCPP structure shares stages 1 and 6 with the BKPP adder; while, stages 2–5 have the same structure as the KSPP adder. This hybrid approach offers potential benefits in terms of lower power consumption and smaller area footprint on the chip due to reduced wiring connections and gates, but it introduces an additional stage compared to the BKPP and LFPP adders.

2.5 Implementation of Radix-2, Radix-4 and Split-Radix using proposed arithmetic structures

In this paper, we introduce a novel adder design that, when combined with the previously mentioned FFT structures, significantly reduces the number of multipliers and adders required for FFT computations. For instance, without the Radix-2 FFT, the first stage in Fig. 1 would necessitate 16 multipliers and 32 adders. However, with the implementation of the radix-2 FFT, only 1 multiplier and 2 adders are needed in each stage. It is important to note that the number of registers remains unchanged. This reduction in multipliers and adders translates to significant area savings. Additionally, by employing parallel prefix adders, we achieve improved performance compared to traditional serial adders. Parallel prefix adders excel due to their ability to perform addition on multiple bits simultaneously, unlike serial adders which process bits one after another.

The proposed architecture depicted in Fig. 9 offers reduced complexity compared to conventional PP adders. Similar to other parallel prefix adders, Stage 1 calculates the generate and propagate values. Stages 2–4 utilize these values to compute the final carry, and the sum is determined in the last stage. The new parallel prefix adder exhibits improved performance and efficiency compared to previously reported adders. This makes it particularly attractive for applications where speed is a critical factor. The proposed adder has been simulated, and the detailed results are presented in the next section.

Fig. 9
figure 9

Our proposed adder

3 Simulation results

The simulations for this work were conducted using Xilinx Vivado and ISE. The M-point input array was converted to its 16-bit binary equivalents by applying twiddle factors and performing computations within the input block. Simulations were performed for Radix-2, Radix-4 and Split-Radix FFT algorithms. The new arithmetic structure demonstrates significant advantages, particularly in terms of power consumption, number of LUTs, operating frequency and delay compared to other structures. This makes it highly suitable for DSP applications where high processing speed is essential. By enhancing the performance of adder blocks, we can minimize resource utilization and consequently reduce the number of computational stages required in FFT implementations. Table 1 summarizes the performance of various adders using metrics such as area, speed, power consumption and power-delay product. As shown in Table 1, the HCPP adder exhibits a lower delay (12.147 ns) compared to KSPP and LFPP adders. This delay is further broken down into logic delay (5.417 ns) and router delay (6.73 ns). Additionally, the HCPP adder utilizes only 24 LUTs, demonstrating a more compact design compared to the KSPP (38 LUTs), BKPP (33 LUTs), and LFPP (26 LUTs) adders. While the BKPP adder has a lower delay than other adders, it occupies a large area based on the number of LUTs. Our proposed adder offers a compelling combination of optimized area, high speed, and lower power consumption (with a power-delay product of 0.380 Wns), making it a preferred choice for designing various blocks in signal processing applications.

Table 1 Comparison of our proposed adder with the conventional adders

We present the results of the synthesis, focusing on the number of utilized LUTs. Tables 2, 3, and 4 present the device utilization summary for different FFT structures when our proposed adder is used. These tables show the utilization of slice LUTs for Radix-2, Radix-4, and Split-Radix structures, respectively. As observed in the tables, the Split-Radix structure exhibits the lowest LUT utilization at \(4\%\), followed by Radix-4 at \(71\%\), and Radix-2 at \(84\%\). Tables 5, 6, and 7 present the number of calculation operations (multiplications and additions) required for different FFT structures with sizes \(M=64\), \(M=256\) and \(M=1024\), respectively. As expected, the number of multiplications and additions increases for all structures as the data size (M) increases. However, the tables also reveal that the Split-Radix structure consistently requires fewer multiplications and additions compared to Radix-2 and Radix-4 structures for all data sizes evaluated.

Table 2 Device utilization summary for 16 point FFT using Radix-2 Algorithm
Table 3 Device utilization summary for 16 point FFT using Radix-4 Algorithm
Table 4 Device utilization summary for FFT using Split-Radix algorithm
Table 5 The number of sums and multiplications for \(M = 64\)
Table 6 Number of sums and multiplications for \(M = 256\)
Table 7 Number of sums and multiplications for \(M = 1024\)

We have compared the processing power, the number of LUTs, working frequency and delay for different algorithms in Figs. 10, 11, 12 and 13 for chips using conventional adders and our proposed adder. According to the values obtained from the simulations in Fig. 10, the power consumption of the Split-Radix algorithm is less than the other methods which is 17 mW and 22 mW for the circuit using proposed adder and conventional adders, respectively which this means that the system is more optimized. Moreover, in the case of using our proposed adder for all FFT algorithms, the consumed power is less than those of conventional adders. As expected, Split-Radix FFT deemed more suitable for FPGA implementation than the Radix-4 and Radix-2 algorithms and it is well-suited for a low-power FFT processor because it requires the fewest arithmetic operations among all FFT algorithms.

Fig. 10
figure 10

The power consumption (mW) for different adders and algorithms

Fig. 11
figure 11

Number of LUTs for different adders and algorithms

Fig. 12
figure 12

Operating frequency (Hz) of the system for different adders and algorithms

Fig. 13
figure 13

Delay (ns) of the system for different adders and algorithms

Figure 11 demonstrates that using our proposed adder leads to lower resource utilization compared to conventional adders. For instance, the Split-Radix algorithm with the proposed adder utilizes fewer resources than the Radix-4 algorithm with a conventional adder. The chips are made up of many LUTs, and the less the number of LUTs used, the more resources are available for the system to perform other calculations. As a result, it increases the speed of the system which can be seen in the Split-Radix method [27]. Operating frequency in Fig. 12 refers to the number of cycles a processor can execute per second. A higher operating frequency generally translates to faster execution of instructions. Based on the results, the Split-Radix algorithm achieves a higher operating frequency, indicating faster operation.

Figure 13 shows the delay experienced by different FFT algorithms implemented with both the conventional adder and the proposed adder. The results indicate that the proposed adder offers lower delay compared to conventional adders for all algorithms. Additionally, the Split-Radix algorithm exhibits the lowest delay among all algorithms based on this metric. Specifically, the delays for Radix-2, Radix-4, and Split-Radix with the proposed adder are 3 ns, 1.2 ns, and 0.5 ns, respectively.

4 Conclusion

In this paper, we have proposed a new arithmetic structure. The aforementioned structure demonstrates impressive performance compared to other structures. It is particularly suitable for use in DSP applications where high processing speed is crucial. By enhancing the performance of adder blocks, the resource utilization and number of computational stages in the implementation of FFT algorithms can be minimized. Our proposed adder offers optimized area, high speed and lower power consumption, making it a preferred choice for designing various blocks in signal processing applications. According to the results of device utilization and computational complexity, Radix-4 and Split-Radix methods are better than Radix-2 method. In the simulation results section, the power consumption, the used chip surface, the working frequency, and delay in the aforementioned methods are compared. From comparing the results, we can see that Radix-4 and Split-Radix are better than Radix-2 algorithm and they work more efficiently. Finally, according to the changes applied in the Split-Radix algorithm, it has a very high efficiency, which is suitable for complex applications.