Deriving Efficient Data Movement from Decoupled Access/Execute Specifications

Howes, Lee W.; Lokhmotov, Anton; Donaldson, Alastair F.; Kelly, Paul H. J.

doi:10.1007/978-3-540-92990-1_14

Lee W. Howes⁶,
Anton Lokhmotov⁶,
Alastair F. Donaldson⁷ &
…
Paul H. J. Kelly⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5409))

Included in the following conference series:

International Conference on High-Performance Embedded Architectures and Compilers

986 Accesses
19 Citations

Abstract

On multi-core architectures with software-managed memories, effectively orchestrating data movement is essential to performance, but is tedious and error-prone. In this paper we show that when the programmer can explicitly specify both the memory access pattern and the execution schedule of a computation kernel, the compiler or run-time system can derive efficient data movement, even if analysis of kernel code is difficult or impossible. We have developed a framework of C++ classes for decoupled Access/Execute specifications, allowing for automatic communication optimisations such as software pipelining and data reuse. We demonstrate the ease and efficiency of programming the Cell Broadband Engine architecture using these classes by implementing a set of benchmarks, which exhibit data reuse and non-affine access functions, and by comparing these implementations against alternative implementations, which use hand-written DMA transfers and software-based caching.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

mxkernel: A Novel System Software Stack for Data Processing on Modern Hardware

Article Open access 06 October 2020

A Formal Model of Parallel Execution on Multicore Architectures with Multilevel Caches

A Maude Framework for Cache Coherent Multicore Architectures

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Hofstee, H.P.: Power efficient processor architecture and the Cell processor. In: Proceedings of the 11th International Conference on High-Performance Computer Architecture (HPCA), pp. 258–262. IEEE Computer Society, Los Alamitos (2005)
Google Scholar
ClearSpeed Technology: The CSX architecture, http://www.clearspeed.com/
Smith, J.E.: Decoupled access/execute computer architectures. ACM Trans. Comput. Syst. 2(4), 289–308 (1984)
Article Google Scholar
Watson, I., Rawsthorne, A.: Decoupled pre-fetching for distributed shared memory. In: Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS), Washington, DC, USA, pp. 252–261. IEEE Computer Society, Los Alamitos (1995)
Google Scholar
Tremblay, M., Chaudhry, S.: A third-generation 65nm 16-core 32-thread plus 32-scout-thread CMT SPARC processor. In: Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC) (2008)
Google Scholar
Topham, N., Rawsthorne, A., McLean, C., Mewissen, M., Bird, P.: Compiling and optimizing for decoupled architectures. In: Proceedings of Supercomputing (SC), p. 40 (1995)
Google Scholar
Lau, D.L., Gonzalez, J.G.: The closest-to-mean filter: an edge preserving smoother for Gaussian environments. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2593–2596. IEEE Press, Los Alamitos (1997)
Google Scholar
Warren, H.S.: Hacker’s Delight. Addison-Wesley, Boston (2002)
Google Scholar
Carter, L., Gatlin, K.S.: Towards an optimal bit-reversal permutation program. In: Proceedings of Foundations of Computer Science (FOCS), pp. 544–555 (1998)
Google Scholar
Wright, C.: IBM software development kit for multicore acceleration. Roadrunner tutorial LA-UR-08-2819 (2008), http://www.lanl.gov/orgs/hpc/roadrunner
Solar-Lezama, A., Arnold, G., Tancau, L., Bodik, R., Saraswat, V., Seshia, S.: Sketching stencils. In: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation (PLDI), pp. 167–178. ACM, New York (2007)
Chapter Google Scholar
Saltz, J.H., Mirchandaney, R., Crowley, K.: Run-time parallelization and scheduling of loops. IEEE Trans. Comput. (5), 603–612 (1991)
Article Google Scholar
Fatahalian, K., et al.: Sequoia: programming the memory hierarchy. In: Proceedings of Supercomputing (SC), pp. 83–92 (2006)
Google Scholar
Bellens, P., Perez, J.M., Badia, R.M., Labarta, J.: CellSs: a programming model for the Cell BE architecture. In: Proceedings of Supercomputing (SC), pp. 86–96 (2006)
Google Scholar
Lokhmotov, A., Mycroft, A., Richards, A.: Delayed side-effects ease multi-core programming. In: Kermarrec, A.-M., Bougé, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 641–650. Springer, Heidelberg (2007)
Chapter Google Scholar
Codeplay Software: Portable high-performance compilers, http://www.codeplay.com/
Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures. Morgan Kaufmann, San Francisco (2002)
Google Scholar
Griebl, M.: Automatic Parallelization of Loop Programs for Distributed Memory Architectures. University of Passau, Habilitation Thesis (2004)
Google Scholar
Gaster, B.R.: Streams: Emerging from a shared memory model. In: Eigenmann, R., de Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 134–145. Springer, Heidelberg (2008)
Chapter Google Scholar
Howes, L.W., Lokhmotov, A., Kelly, P.H., Field, A.J.: Optimising component composition using indexed dependence metadata. In: Proceedings of the 1st International Workshop on New Frontiers in High-performance and Hardware-aware Computing (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing, Imperial College London, 180 Queen’s Gate, London, SW7 2AZ, UK
Lee W. Howes, Anton Lokhmotov & Paul H. J. Kelly
Codeplay Software, 45 York Place, Edinburgh, EH1 3HP, UK
Alastair F. Donaldson

Authors

Lee W. Howes
View author publications
You can also search for this author in PubMed Google Scholar
Anton Lokhmotov
View author publications
You can also search for this author in PubMed Google Scholar
Alastair F. Donaldson
View author publications
You can also search for this author in PubMed Google Scholar
Paul H. J. Kelly
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IRISA, Campus de Beaulieu, 35042, Rennes Cedex, France
André Seznec
Intel Corporation, Massachusetts Microprocessor Design Center, 77 Reed Road, MA 01749, Hudson, USA
Joel Emer
School of Informatics, Institute for Computing Systems Architecture, King’ s Buildings, EH9 3JZ, Edinburgh, United Kingdom
Michael O’Boyle
Department of Electrical Engineering, Princeton University, 34 Olden Street, NJ 08544-5263, Princeton, USA
Margaret Martonosi
Department of Computer Science, University of Augsburg, 86135, Augsburg, Germany
Theo Ungerer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Howes, L.W., Lokhmotov, A., Donaldson, A.F., Kelly, P.H.J. (2009). Deriving Efficient Data Movement from Decoupled Access/Execute Specifications. In: Seznec, A., Emer, J., O’Boyle, M., Martonosi, M., Ungerer, T. (eds) High Performance Embedded Architectures and Compilers. HiPEAC 2009. Lecture Notes in Computer Science, vol 5409. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-92990-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-540-92990-1_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-92989-5
Online ISBN: 978-3-540-92990-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deriving Efficient Data Movement from Decoupled Access/Execute Specifications

Abstract

Chapter PDF

Similar content being viewed by others

mxkernel: A Novel System Software Stack for Data Processing on Modern Hardware

A Formal Model of Parallel Execution on Multicore Architectures with Multilevel Caches

A Maude Framework for Cache Coherent Multicore Architectures

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Deriving Efficient Data Movement from Decoupled Access/Execute Specifications

Abstract

Chapter PDF

Similar content being viewed by others

mxkernel: A Novel System Software Stack for Data Processing on Modern Hardware

A Formal Model of Parallel Execution on Multicore Architectures with Multilevel Caches

A Maude Framework for Cache Coherent Multicore Architectures

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation