Improving Performance Portability in OpenCL Programs

Zhang, Yao; Sinclair, Mark; Chien, Andrew A.

doi:10.1007/978-3-642-38750-0_11

Yao Zhang¹⁹,
Mark Sinclair II¹⁹ &
Andrew A. Chien¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7905))

Included in the following conference series:

International Supercomputing Conference

2649 Accesses
29 Citations
1 Altmetric

Abstract

We study the performance portability of OpenCL across diverse architectures including NVIDIA GPU, Intel Ivy Bridge CPU, and AMD Fusion APU. We present detailed performance analysis at assembly level on three exemplar OpenCL benchmarks: SGEMM, SpMV, and FFT. We also identify a number of tuning knobs that are critical to performance portability, including threads-data mapping, data layout, tiling size, data caching, and operation-specific factors. We further demonstrate that proper tuning could improve the OpenCL portable performance from the current 15% to a potential 67% of the state-of-the-art performance on the Ivy Bridge CPU. Finally, we evaluate the current OpenCL programming model, and propose a list of extensions that improve performance portability.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

A Comparison of the Scalability of OpenMP Implementations

Early Experiences with the OpenMP Accelerator Model

OpenMP as a High-Level Specification Language for Parallelism

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

The OpenACC application programming interface 1.0 (November 2011), http://www.openacc-standard.org/
The OpenCL specification 1.2 (November 2011), http://www.khronos.org/registry/cl/
Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.W.: An adaptive performance modeling tool for GPU architectures. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010), pp. 105–114 (January 2010)
Google Scholar
Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the 2009 ACM/IEEE Conference on Supercomputing (SC 2009), pp. 18:1–18:11 (November 2009)
Google Scholar
Carlson, W., Draper, J., Culler, D., Yelick, K., Brooks, E., Warren, K.: Introduction to UPC and language specification. Center for Computing Sciences, Institute for Defense Analyses (1999)
Google Scholar
Chien, A.A., Snavely, A., Gahagan, M.: 10x10: A general-purpose architectural approach to heterogeneity and energy efficiency. Procedia CS 4, 1987–1996 (2011)
Google Scholar
Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on GPUs. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010), pp. 115–126 (January 2010)
Google Scholar
Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU 2010), pp. 63–74. ACM, New York (2010)
Chapter Google Scholar
Davidson, A., Zhang, Y., Owens, J.D.: An auto-tuned method for solving large tridiagonal systems on the GPU. In: Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium, pp. 956–965 (May 2011)
Google Scholar
Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Comput. 38(8), 391–407 (2012)
Article Google Scholar
Fang, J., Varbanescu, A.L., Sips, H.: A comprehensive performance comparison of CUDA and OpenCL. In: Proceedings of the 2011 International Conference on Parallel Processing (ICPP 2011), pp. 216–225. IEEE Computer Society, Washington, DC (2011)
Chapter Google Scholar
Goto, K., Van De Geijn, R.: High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1), 4:1–4:14 (2008)
Article MathSciNet Google Scholar
Hong, S., Kim, H.: An integrated GPU power and performance model. In: Proceedings of the 37th International Symposium on Computer Architecture (ISCA 2010), pp. 280–289 (2010)
Google Scholar
Komatsu, K., Sato, K., Arai, Y., Koyama, K., Takizawa, H., Kobayashi, H.: Evaluating performance and portability of OpenCL programs. In: The Fifth International Workshop on Automatic Performance Tuning (June 2010)
Google Scholar
Loveman, D.: High performance Fortran. IEEE Parallel & Distributed Technology: Systems & Applications 1(1), 25–42 (1993)
Article Google Scholar
Meng, J., Skadron, K.: Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In: Proceedings of the 23rd International Conference on Supercomputing (ICS 2009), pp. 256–265 (June 2009)
Google Scholar
NVIDIA Corporation. NVIDIA CUDA compute unified device architecture, programming guide 5.0 (October 2012), http://developer.nvidia.com/
Rice, J.R., Boisvert, R.F.: Solving Elliptic Problems using ELLPACK. Springer-Verlag New York, Inc. (1984)
Google Scholar
Rul, S., Vandierendonck, H., D’Haene, J., De Bosschere, K.: An experimental study on performance portability of OpenCL kernels. In: 2010 Symposium on Application Accelerators in High Performance Computing, p. 3 (2010)
Google Scholar
Seo, S., Jo, G., Lee, J.: Performance characterization of the NAS parallel benchmarks in OpenCL. In: 2011 IEEE International Symposium on Workload Characterization (IISWC 2011), pp. 137–148 (November 2011)
Google Scholar
Shen, J., Fang, J., Sips, H., Varbanescu, A.: Performance gaps between OpenMP and OpenCL for multi-core CPUs. In: 2012 41st International Conference on Parallel Processing Workshops (ICPPW 2012), pp. 116–125 (September 2012)
Google Scholar
Stratton, J.A., Stone, S.S., Hwu, W.-m.W.: MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 16–30. Springer, Heidelberg (2008)
Chapter Google Scholar
Thoman, P., Kofler, K., Studt, H., Thomson, J., Fahringer, T.: Automatic openCL device characterization: Guiding optimized kernel design. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 438–452. Springer, Heidelberg (2011)
Chapter Google Scholar
Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC 2008), pp. 31:1–31:11 (November 2008)
Google Scholar
Volkov, V., Kazian, B.: Fitting FFT onto the G80 architecture (May 2008), http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project6_report.pdf/
Zhang, Y., Owens, J.D.: A quantitative performance analysis model for GPU architectures. In: Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA 17), pp. 382–393 (February 2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Chicago, USA
Yao Zhang, Mark Sinclair II & Andrew A. Chien

Authors

Yao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Mark Sinclair II
View author publications
You can also search for this author in PubMed Google Scholar
Andrew A. Chien
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Hamburg, Department of Informatics, Bundestraße 45a, 20146, Hamburg, Germany
Julian Martin Kunkel
Deutsches Klimarechenzentrum, Bundestraße 45a, 20146, Hamburg, Germany
Thomas Ludwig
Germany and Prometeus GmbH, University of Mannheim, Fliederstraße 2, 74915, Waibstadt, Germany
Hans Werner Meuer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Sinclair, M., Chien, A.A. (2013). Improving Performance Portability in OpenCL Programs. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds) Supercomputing. ISC 2013. Lecture Notes in Computer Science, vol 7905. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38750-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-38750-0_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38749-4
Online ISBN: 978-3-642-38750-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Performance Portability in OpenCL Programs

Abstract

Chapter PDF

Similar content being viewed by others

A Comparison of the Scalability of OpenMP Implementations

Early Experiences with the OpenMP Accelerator Model

OpenMP as a High-Level Specification Language for Parallelism

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Improving Performance Portability in OpenCL Programs

Abstract

Chapter PDF

Similar content being viewed by others

A Comparison of the Scalability of OpenMP Implementations

Early Experiences with the OpenMP Accelerator Model

OpenMP as a High-Level Specification Language for Parallelism

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation