Keywords

1 Introduction

FFT has been studied far and wide. Every year, new results about its implementation appear, boosting the speed of some of the most famous versions, such as FFTW [1]. Recently, NEC-SX Aurora Vector Engine has been used to test the behaviour of some FFT’s implementations on large vector registers (256 double, 16384 bit per register) [2]. That is an important result for our work, since it pushes the usage of SIMD/vectorized architectures. Nevertheless, outside the world of High Performance Computing (HPC), the most available SIMD technology is the AVX-512 extension (see Sect. 3) which is spreading among x86 CPUs, both Intel and AMD. Hence in this work we focus on the latter, with the following contributions:

  • we experiment a different and uncommon way to memorize complex numbers;

  • we work on a manual vectorization of the FFT, keeping it simple and readable so that it can be used for teaching purposes;

  • we measure the performances of different versions, pointing out the advantages of using AVX-512.

2 Different Data Layouts: An Overview

We can memorize complex numbers as pairs of floating point numbers, choosing the dimension (single/double precision) best suited for the needs of the computation at hand. Once we have a complex structure, by instantiating an array we obtain a sequence of numbers where the real and imaginary parts are staggered. We can call this kind of memorization complex interleaved (Fig. 1a).

An alternative is to memorize separately the two components in two arrays of floating point numbers. This is called block interleaved memorization (Fig. 1b): that is not common at all since existing software and standards for C/C++ only support the interleaved data format [3], but it could be useful dealing with vector registers and data gathering from memory.

Exploiting a mixture of these memorizations to boost the performance of algorithms on SIMD architectures has already been studied [3], achieving up to 2x performance improvements over state of the art library implementations. Our work will study and compare both types of memorization.

3 The AVX-512 Instruction Set

It has been a while now since computers have had some kind of SIMD extensions. SSE and AVX2 are fundamental in the history of this process, though their usage was limited by the length of their registers, respectively 128 and 256 bits. AVX-512 came out in 2013 as an improvement of the latter, introducing new instructions and providing registers 512 bits long.

Fig. 1.
figure 1

Different memorizations for array of complex numbers

Despite the speedup you can get from these instructions, AVX-512 is not always appropriate: it does not make IO-bound programs faster, as well as programs with complex conditional behaviours, since there are no parallel operations to execute; the tasks which can be boosted because of their parallelism regard AI, cryptography, mathematical computations... Programmers should understand where and when this extension could be useful, in order to gain a speedup which is independent from the algorithm itself.

Browsing the Intel’s intrinsics guideFootnote 1 is a great starting point: this way one may familiarize with nomenclature and the different available instructions.

3.1 Loop Unrolling

One of the main usage of SIMD instructions is through loop unrolling, which allows to avoid loops or diminish the number of iterations. For instance, suppose one need to compute an Hadamard porduct of two arrays of 8 doubles each. Exploiting AVX-512, one can proceed in this way:

figure a

In case the length of the arrays was greater than 8 (but, for simplicity, still multiple of 8) one could iterate this snippet N/8 times.

3.2 Superword Level Parallelism

SLP is another technique widely adopted by programmers and compilers to perform vectorization. It consists of gathering instructions which are similar but not directly linked, and computing them using SIMD registers. An example of this technique is shown in Sect. 5.2.

4 The Recursive Version of the FFT Algorithm

Radix-2 algorithm is the simplest way to decrease the complexity of the DFT (Discrete Fourier Transform), from \(O(N^2)\) to O(NlogN), though nowadays FFT algorithms are thousands of lines of code long (they perform different operations based on different kinds of input and of the available hardware). Furthermore, an iterative version of the algorithm can be way more optimized than a recursive one, which is forced to use the stack an exponential amount of times.

Despite this, our work just aims to experiment with the operation of manual vectorization of the code; for this reason, we looked for an algorithm which is both interesting and useful in real-world applications: the recursive FFT is simple and follows the mathematical expression provided by Cooley and Tukey in 1965 [4]. It was easier to get into, but feasible enough to test AVX-512 capabilities and the two types of memorization as shown above. As most of the real-world use cases of the FFT, we will just consider input whose length is a power of 2.

4.1 Twiddle Factor: How Memoization Can Help

The first thing we observed about the trivial version of FFT we used was the enormous usage of trigonometric functions to calculate the roots of the unity. For each call of the function on an input of length N, one have to use all the roots of order N, expressed as

$$\begin{aligned} e^{-j\frac{2\pi k}{N}} = \cos (\frac{2\pi k}{N}) - j \sin (\frac{2\pi k}{N}) \quad k \in \{0,1,...,N-1\} \end{aligned}$$

The trigonometric functions are tremendously time-consuming for CPUs. We made a simple benchmarkFootnote 2 of the FFT with an input of \(2^{13}\) both using and avoiding the computation of these roots: the latter was two times faster than the former.

An immediate observation is that once one computed the roots of order N, the following calls of the function with an input of the same length can use them again. This technique is well known in literature as twiddle factor [5].

The needed roots can be both calculated before executing the algorithm or computed them on the way and saved for further use. That is what we did: we introduced a look-up table where we saved the results of the computation for N, so that we could access them later. The idea recalls the memoization of dynamic programming.

5 The Vectorized Version of the Recursive FFT

We made two different versions of the AVX-512 FFT, one with a complex-interleaved memorization (called CI_AVX), another with a block-interleaved memorization (called BI_AVX). The C++ source code of both the versions has been made public available and can be downloaded from this link: https://github.com/pcineverdies/FFT-AVX-512.

5.1 Link to Hadamard Product

The main element to vectorize the function is to exploit the radix-2 expression of the FFT: given an input X of size \(N = 2^M\), its DFT is equal to

$$\begin{aligned} DFT(X)_k = DFT(X_e)_k + e^{-j\frac{2\pi k}{N}} \cdot DFT(X_o)_k \quad k \in \{0,1,...,N-1\} \end{aligned}$$

where \(DFT(X_e)\) is the DFT of the even terms of X, \(DFT(X_o)\) is the DFT of its odd terms. As we compute these two elements using recursion, the result is made by the element-wise product between the vector of the roots and \(DFT(X_o)\), added to \(DFT(X_e)\). That is an interesting result, since element-wise product can be easily vectorized with both the memorizations of complex numbers. An intuitive idea of the process is shown in Fig. 2 (that figure is inspired by one found in [3]).

Fig. 2.
figure 2

Element-wise product using two memorizations.

5.2 Base Cases of Recursion

As a consequence of the DFT’s expression, when the input is a sequence of 4 or 2 elements everything can be brought back to additions and subtractions between real and imaginary parts of the input. That is why we can avoid a recursion up to a sequence of length 1 (the DFT of a number is the number itself), and stop at a length of 4. We also added the cases of input with a length of 2 and 1, which are stand-alone situations.

In this base cases, we tried to apply SLP, as shown in the snippet below (DFT of an input of length 4 in the array of complex wave, with CI memorization): the first block of instructions gathers data in a correct way, while the second one computes the additions/subtractions which give us the final result.

figure b

6 Experimental Results

In the next subsection we provide the obtained numerical results, while in the following we discuss why, in our implementation, block interleaved does not give any additional speedup.

6.1 Numerical Results

We measured the performance of six versions of the FFT:

  • NO_AVX, which is a standard version of the FFT, optimized with the twiddle factor, compiled with O3 flag but without auto-vectorization;

  • VE_AVX, which is the same as above, though auto-vectorization is enabled;

  • CI_AVX, which is the version vectorized by hand with complex interleaved memorization, compiled with O3 flag.

  • BI_AVX, which is the version vectorized by hand and block interleaved memorization, compiled with O3 flag.

In order to do that, we calculated how much time passed between the call of the function and its termination: after \(2^{13}\) measures, we extracted the median of the data, which is more statistically stable than the arithmetic average.

Table 1. Execution time (\(\mu \)s) of FFT for some values of N
Fig. 3.
figure 3

Charts of the measures

Some of the results are shown in Table 1, while a complete overview for N between \(2^3\) and \(2^{17}\) can be found in Fig. 3. As it is evident from the numbers, the vectorized versions are more efficient the the standard one, by the 33.78%(\(\sim 1.51 \times \)).

6.2 Why Is Block Interleaved not Good Enough in our Setting?

As we immediately notice from the result, the block interleaved version is quite similar to the complex interleaved one; in some cases it is even slightly slower. That is not what we expected: since this memorization method is not common, we would need a major speedup to use it.

In [3] the authors use a mixed version of the two methods: they start from a CI input, then they use the computations of the algorithm itself to get a BI memorization (which makes some operations faster, such as the element-wise product, since it reduces the usages of slow instructions as permutations) and they finally end up with a CI result. Instead, since we start directly from a block interleaved version, we are note able to replicate their speedups when using BI.

7 Conclusions

The final result of our experiments is a speedup of a 33.78%(\(\sim \)1.51\(\times \)) between the first trivial version and the vectorized one. The automatic vectorization reaches a much lower speedup, which amounts to 10.66%(\(\sim \)1.12\(\times \)). We have shown that BI memorization, while being not common and not compatible with standards like POSIX, does not provide any advantage over CI.

We would like to point out the importance of the AVX extension for programs which aim to achieve efficiency and speed. Clearly, writing our own vectorized code is not the best way to exploit this functionality, since we could make mistakes and it becomes difficult to maintain: the right approach should be to ask the compilers to introduce the functionality mentioned in the previous sections, suggesting some choices using pragmas.

In the end, the result could be not satisfying enough: in that case the programmer can disassemble the compiled code and try changing some instructions to speedup the program. And that is why it is important to be familiar with this extension. This is what we have learnt in this study. As a future work, we will:

  • extend the code to exploit multi-threading, using the recently introduced C++ standard class for multi-threading;

  • realize a port on CPU clusters [6];

  • investigate how to optimize the impact on cache hierarchies [7].