Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

Abdelfattah, Ahmad; Keyes, David; Ltaief, Hatem

doi:10.1007/978-3-642-36949-0_23

Ahmad Abdelfattah²⁷,
David Keyes²⁷ &
Hatem Ltaief²⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7640))

Included in the following conference series:

European Conference on Parallel Processing

2624 Accesses
5 Citations

Abstract

The use of GPUs has been very beneficial in accelerating dense linear algebra computational kernels (DLA). Many high performance numerical libraries like CUBLAS, MAGMA, and CULA provide BLAS and LAPACK implementations on GPUs as well as hybrid computations involving both, CPUs and GPUs. GPUs usually score better performance than CPUs for compute-bound operations, especially those characterized by a regular data access pattern. This paper highlights a systematic approach for efficiently implementing memory-bound DLA kernels on GPUs, by taking advantage of the underlying device’s architecture (e.g., high throughput). This methodology proved to outperform existing state-of-the-art GPU implementations for the symmetric matrix-vector multiplication (SYMV), characterized by an irregular data access pattern, in a recent work (Abdelfattah et. al, VECPAR 2012). We propose to extend this methodology to the general matrix-vector multiplication (GEMV) kernel. The performance results show that our GEMV implementation achieves better performance for relatively small to medium matrix sizes, making it very influential in calculating the Hessenberg and bidiagonal reductions of general matrices (radar applications), which are the first step toward computing eigenvalues and singular values, respectively. Considering small and medium size matrices (≤4500), our GEMV kernel achieves an average 60% improvement in single precision (SP) and an average 25% in double precision (DP) over existing open-source and commercial software solutions. These results improve reduction algorithms for both small and large matrices. The improved GEMV performances engender an averge 30% (SP) and 15% (DP) in Hessenberg reduction and up to 25% (SP) and 14% (DP) improvement for the bidiagonal reduction over the implementation provided by CUBLAS 5.0.

Download to read the full chapter text

Chapter PDF

Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

Dense and Sparse Matrix-Vector Multiplication on Maxwell GPUs with PyCUDA

Accelerating Numerical Dense Linear Algebra Calculations with GPUs

Keywords

References

CULA Dense Free Edition, http://www.culatools.com/
Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee, http://icl.cs.utk.edu/magma/
NVIDIA CUDA Toolkit, http://developer.nvidia.com/cuda-toolkit
Nvidia visual profiler, http://developer.nvidia.com/nvidia-visual-profiler
Performance Application Programming Interface (PAPI). Innovative Computing Laboratory, University of Tennessee, http://icl.cs.utk.edu/papi/
The NVIDIA CUDA Basic Linear Algebra Subroutines (CUBLAS), http://developer.nvidia.com/cublas
Abdelfattah, A., Dongarra, J., Keyes, D., Ltaief, H.: Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators. In: The 10th International Meeting on High Performance Computing for Computational Science, VECPAR 2012 (accepted, 2012)
Google Scholar
Humphrey, J.R., Price, D.K., Spagnoli, K.E., Paolini, A.L., Kelmelis, E.J.: CULA: Hybrid GPU Accelerated Linear Algebra Routines. In: Proceedings of SPIE Defense and Security Symposium, DSS (April 2010)
Google Scholar
Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMM Kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems PP(99), 1 (2012)
Article Google Scholar
Kurzak, J., Luszczek, P., Tomov, S., Dongarra, J.: Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture - GeForce GTX 680. LAPACK Working Note 267
Google Scholar
Kwon, Y., Narayanan, R.M., Rangaswamy, M.: A multi-target detector using mutual information for noise radar systems in low snr regimes. In: 2010 International Waveform Diversity and Design Conference, WDD, pp. 000105–000109 (August 2010)
Google Scholar
Nath, R., Tomov, S., Dong, T., Dongarra, J.: Optimizing symmetric dense matrix-vector multiplication on GPUs. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 6:1–6:10. ACM, New York (2011)
Google Scholar
Nath, R., Tomov, S., Dongarra, J.: An Improved Magma Gemm for Fermi Graphics Processing Units. Int. J. High Perform. Comput. Appl. 24(4), 511–515 (2010)
Article Google Scholar
Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 31:1–31:11. IEEE Press, Piscataway (2008)
Google Scholar
Yu, W.C., Quan, W.D.: On the signal processing in the life-detection radar using an fmcw waveform. In: 2010 Third International Symposium on Information Processing, ISIP, pp. 213–216 (October 2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Division of Mathematical and Computer Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Ahmad Abdelfattah & David Keyes
Supercomputing Laboratory, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Hatem Ltaief

Authors

Ahmad Abdelfattah
View author publications
You can also search for this author in PubMed Google Scholar
David Keyes
View author publications
You can also search for this author in PubMed Google Scholar
Hatem Ltaief
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Technology Institute and Press “Diophantus” & Department of Computer Engineering and Informatics, University of Patras, 26504, Rio, Greece
Ioannis Caragiannis
Technische Universität Wien, Austria
Michael Alexander
Artificial Intelligence Research Institute (IIIA), Spanish National Research Council (CSIC), Spain
Rosa Maria Badia
Department of Medical and Surgical Sciences, Bioinformatics Laboratory, University Magna Græcia of Catanzaro, 88100, Catanzaro, Italy
Mario Cannataro
Inria Rennes, France
Alexandru Costan
Dept. Computer Science, Univ. Pisa, Largo Pontecorvo 3, 56127, Pisa, Italy
Marco Danelutto
Inria, 46 Allée d’Italie, 69364, Lyon Cedex 7, France
Frédéric Desprez
Université de Versailles, France
Bettina Krammer
Department of Computer Engineering (DISCA), Universitat Politècnica de València, Spain
Julio Sahuquillo
Oak Ridge National Laboratory, USA
Stephen L. Scott
Technische Universität München, Germany
Josef Weidendorfer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abdelfattah, A., Keyes, D., Ltaief, H. (2013). Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU. In: Caragiannis, I., et al. Euro-Par 2012: Parallel Processing Workshops. Euro-Par 2012. Lecture Notes in Computer Science, vol 7640. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36949-0_23

Download citation

DOI: https://doi.org/10.1007/978-3-642-36949-0_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36948-3
Online ISBN: 978-3-642-36949-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

Abstract

Chapter PDF

Similar content being viewed by others

Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

Dense and Sparse Matrix-Vector Multiplication on Maxwell GPUs with PyCUDA

Accelerating Numerical Dense Linear Algebra Calculations with GPUs

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

Abstract

Chapter PDF

Similar content being viewed by others

Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

Dense and Sparse Matrix-Vector Multiplication on Maxwell GPUs with PyCUDA

Accelerating Numerical Dense Linear Algebra Calculations with GPUs

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation