Abstract
Fourier methods have revolutionized many fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, and the fast Fourier transform (FFT) is a computationally efficient method of generating a Fourier transform. The emerging class of high performance computing architectures, such as GPU, seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software. However, the complexity of GPU programming poses a significant challenge to developers. In this paper, we propose an automatic performance tuning framework for FFT on various OpenCL GPUs, and implement a high performance library named MPFFT based on this framework. For power-of-two length FFTs, our library substantially outperforms the clAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs. Furthermore, our library also supports non-power-of-two size. For 3D non-power-of-two FFTs, our library delivers 1.5x to 28x faster than FFTW with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Duhamel P, Vetterli M. Fast fourier transforms: A tutorial review and a state of the art. Signal Processing, 1990, 9(14): 259-299.
Govindaraju N K, Lloyd B, Dotsenko Y, Smith B, Manferdelli J. High performance discrete Fourier transforms on graphics processors. In Proc. SC, Nov. 2008, Article No.2.
Nukada A, Matsuoka S. Auto-tuning 3-D FFT library for CUDA GPUs. In Proc. SC, Nov. 2009, Article No.30.
Dotsenko Y, Baghsorkhi S S, Lloyd B, Govindaraju N K. Auto-tuning of fast Fourier transform on graphics processors. In Proc PPoPP, Feb. 2011, pp.257-266.
Gu L, Li X M, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In Proc. the 24th ICS, June 2010, pp.305-314.
Gaster B, Howes L, Kaeli D R, Mistry P, Schaa D. Heterogeneous Computing with OpenCL. San Fransisco, USA: Morgan Kaufmann, 2011.
Munshi A, Gaster B, Mattson T G, Fung J, Ginsburg D. OpenCL Programming Guide. Boston, USA: Addison-Wesley Professional, 2011.
Zhang E Z, Jiang Y L, Guo Z Y, Shen X P. Streamlining GPU applications on the fly: Thread divergence elimination through runtime thread-data remapping. In Proc. the 24th ICS, June 2010, pp.115-126.
Yang Y, Xiang P, Kong J F, Zhou H Y. A GPGPU compiler for memory optimization and parallelism management. In Proc. PLDI, June 2010, pp.86-97.
Cooley J W, Tukey J W. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 1965, 19: 297-301.
Van Loan C. Computational Frameworks for the Fast Fourier Transform. Philadelphia, USA: SIAM, 1992.
Johnson J, Johnson R W, Rodriguez D, Tolimieri R. A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures. Circuits, Systems and Signal Processing, 1990, 9(4): 449-500.
Franchetti F, PÄuschel M, Voronenko Y, Chellappa S, Moura J M F. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, 2009, 26(6): 90-102.
Tolimieri R, An M, Lu C. Algorithms for Discrete Fourier Transforms and Convolution. Berlin: Springer-Verlag, 1989.
NVIDIA Corporation. NVIDIA CUDA compute unified device architecture − Programming guide version 4.2. 2008, http://developer.nvidia.com/cuda/nvidia-gpu-computing-documentation, Sept. 2012.
Ji F, Ma X S. Using shared memory to accelerate MapReduce on graphics processing units. In Proc. IPDPS, May 2011, pp.805-816.
Jiao Y, Lin H, Balaji P, Feng W. Power and performance characterization of computational kernels on the GPU. In Proc. GreenCom-CPSCOM, Dec. 2010, pp.221-228.
Baghsorkhi S S, Delahaye M, Patel S J, Gropp W D, Hwu W W. An adaptive performance modeling tool for GPU architectures. In Proc. the 15th PPoPP, May 2010, pp.105-114.
Schwarztrauber P N. Multiprocessor FFTs. Parallel Computing, 1987, 5: 197-210.
Frigo M, Johnson S G. The design and implementation of FFTW3. In Proceedings of the IEEE, 2005, 93(2): 216-231.
Frigo M. A fast Fourier transform compiler. In Proc. PLDI, May 1999, pp.169-180.
Mesmay F, Franchetti F, Voronenko Y. Encyclopedia of Parallel Computing. Berlin: Springer, 2011.
de Mesmay F, Voronenko Y, PÄuschel M. Offline library adaptation using automatically generated heuristics. In Proc. IPDPS, Apr. 2010, pp.1-10.
Kirk D B, Hwu W W. Programming Massively Parallel Processors: A Hands-on Approach. San Fransisco, USA: Morgan Kaufmann, 2010.
Purnomo B, Rubin N, Houston M. ATI stream profiler: A tool to optimize an OpenCL kernel on ATI radeon GPUs. In Proc. SIGGRAPH, July 2011, pp.26-30.
Mirkovic D, Johnsson S L. Automatic performance tuning in the UHFFT library. In Proc. ICCS, May 2001, pp.71-80.
Ali A, Johnsson L, Mirkovic D. Empirical auto-tuning code generator for FFT and trigonometric transforms. In Proc. ODES, Mar. 2007.
Mirkovic D, Mahasoom R, Johnsson L. An adaptive software library for fast Fourier transforms. In Proc. the 14th ICS, May 2000, pp.215-224.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is supported in partial by the National Natural Science Foundation of China under Grant Nos. 61133005, 61272136, 61100073, 61100066, the National High Technology Research and Development 863 Program of China under Grant Nos. 2012AA010902, 2012AA010903, and the Chinese Academy of Sciences Special Grant for Postgraduate Research, Innovation and Practice.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Li, Y., Zhang, YQ., Liu, YQ. et al. MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs. J. Comput. Sci. Technol. 28, 90–105 (2013). https://doi.org/10.1007/s11390-013-1314-8
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-013-1314-8