Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs

Huang, Dafei; Wen, Mei; Xun, Changqing; Chen, Dong; Cai, Xing; Qiao, Yuran; Wu, Nan; Zhang, Chunyuan

doi:10.1007/978-3-319-09873-9_18

Dafei Huang^16,17,
Mei Wen^16,17,
Changqing Xun^16,17,
Dong Chen^16,17,
Xing Cai¹⁸,
Yuran Qiao^16,17,
Nan Wu^17,18 &
…
Chunyuan Zhang^16,17

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8632))

Included in the following conference series:

European Conference on Parallel Processing

2932 Accesses
3 Citations

Abstract

When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. When executing GPU-specific kernels on CPUs, local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns by using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by removing all the unwanted local-memory arrays together with the obsolete barrier statements. Experiments show that the automated transformation can satisfactorily improve OpenCL kernel performances on Sandy Bridge CPU and Intel’s Many-Integrated-Core coprocessor.

Supported by the National Nature Science Foundation of China under No. 61033008, 61272145, and 61103080; 863 Program under No. 2012AA012706.

Download to read the full chapter text

Chapter PDF

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

Article 07 November 2015

A Multi-Level Platform-Independent GPU API for High-Level Programming Models

Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMP

Keywords

References

FreeOCL: multi-platform implementation of OpenCL 1.2 targeting CPUs, https://code.google.com/p/freeocl/
The LLVM compiler infrastructure, http://llvm.org/
Balasundaram, V., Kennedy, K.: A technique for summarizing data access and its use in parallelism enhancing transformations. In: SIGPLAN 1989 Conference on Programming Language Design and Implementation, Portland, USA, pp. 41–53 (1989)
Google Scholar
Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: A compiler framework for optimization of affine loop nests for GPGPUs. In: 22nd International Conference on Supercomputing, Island of Kos, Greece, pp. 225–234 (June 2008)
Google Scholar
Bastoul, C.: Code generation in the polyhedral model is easier than you think. In: 13th International Conference on Parallel Architectures and Compilation Techniques, Antibes Juan-les-Pins, France, pp. 7–16 (September 2004)
Google Scholar
Gummaraju, J., Morichetti, L., Houston, M., Sander, B., Gaster, B.R., Zheng, B.: Twin peaks: A software platform for heterogeneous computing on general-purpose and graphics processors. In: 19th International Conference on Parallel Architectures and Compilation Techniques, Vienna, Austria, pp. 205–216 (September 2010)
Google Scholar
Intel Corporation: Intel SDK for OpenCL Applications XE 2013 Optimization Guide (2013)
Google Scholar
Nvidia: OpenCL Best Practices Guide (February 2011)
Google Scholar
Nvidia: OpenCL Programming Guide for the CUDA Architecture (February 2011)
Google Scholar
Pennycook, S., Hammond, S., Wright, S., Herdman, J., Miller, I., Jarvis, S.A.: An investigation of the performance portability of OpenCL. Journal of Parallel and Distributed Computing 73(11), 1439–1450 (2013)
Article Google Scholar
Seo, S., Lee, J., Jo, G., Lee, J.: Automatic OpenCL work-group size selection for multicore CPUs. In: 22nd International Conference on Parallel Architectures and Compilation Techniques, Edinburgh, UK (September 2013)
Google Scholar
Stratton, J.A., Grover, V., Marathe, J., Aarts, B., Murphy, M., Hu, Z., Hwu, W.M.W.: Efficient compilation of fine-grained SPMD threaded programs for multicore CPUs. In: 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, Toronto, Canada, pp. 111–119 (April 2010)
Google Scholar
Stratton, J.A., Stone, S.S., Hwu, W. M.W.: MCUDA: An effective implementation of CUDA kernels for multi-core CPUs. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 16–30. Springer, Heidelberg (2008)
Chapter Google Scholar
Stratton, J.A., Kim, H.S., Jablin, T.B., Hwu, W.M.W.: Performance portability in accelerated parallel kernels. Tech. Rep. IMPACT-13-01, University of Illinois at Urbana-Champaign (May 2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer, National University of Defense Technology, China
Dafei Huang, Mei Wen, Changqing Xun, Dong Chen, Yuran Qiao & Chunyuan Zhang
State Key Laboratory of High Performance Computing, Changsha, China
Dafei Huang, Mei Wen, Changqing Xun, Dong Chen, Yuran Qiao, Nan Wu & Chunyuan Zhang
Simula Research Laboratory, Oslo, Norway
Xing Cai & Nan Wu

Authors

Dafei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Mei Wen
View author publications
You can also search for this author in PubMed Google Scholar
Changqing Xun
View author publications
You can also search for this author in PubMed Google Scholar
Dong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xing Cai
View author publications
You can also search for this author in PubMed Google Scholar
Yuran Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Nan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Chunyuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CRACS/INESC-TEC and FCUP, Universidade do Porto, Rua do Campo Alegre, 1021, 4169-007, Porto, Portugal
Fernando Silva , Inês Dutra & Vítor Santos Costa , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, D. et al. (2014). Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs. In: Silva, F., Dutra, I., Santos Costa, V. (eds) Euro-Par 2014 Parallel Processing. Euro-Par 2014. Lecture Notes in Computer Science, vol 8632. Springer, Cham. https://doi.org/10.1007/978-3-319-09873-9_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-09873-9_18
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09872-2
Online ISBN: 978-3-319-09873-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs

Abstract

Chapter PDF

Similar content being viewed by others

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

A Multi-Level Platform-Independent GPU API for High-Level Programming Models

Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMP

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs

Abstract

Chapter PDF

Similar content being viewed by others

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

A Multi-Level Platform-Independent GPU API for High-Level Programming Models

Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMP

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation