Abstract
On multi-core architectures with software-managed memories, effectively orchestrating data movement is essential to performance, but is tedious and error-prone. In this paper we show that when the programmer can explicitly specify both the memory access pattern and the execution schedule of a computation kernel, the compiler or run-time system can derive efficient data movement, even if analysis of kernel code is difficult or impossible. We have developed a framework of C++ classes for decoupled Access/Execute specifications, allowing for automatic communication optimisations such as software pipelining and data reuse. We demonstrate the ease and efficiency of programming the Cell Broadband Engine architecture using these classes by implementing a set of benchmarks, which exhibit data reuse and non-affine access functions, and by comparing these implementations against alternative implementations, which use hand-written DMA transfers and software-based caching.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Hofstee, H.P.: Power efficient processor architecture and the Cell processor. In: Proceedings of the 11th International Conference on High-Performance Computer Architecture (HPCA), pp. 258–262. IEEE Computer Society, Los Alamitos (2005)
ClearSpeed Technology: The CSX architecture, http://www.clearspeed.com/
Smith, J.E.: Decoupled access/execute computer architectures. ACM Trans. Comput. Syst. 2(4), 289–308 (1984)
Watson, I., Rawsthorne, A.: Decoupled pre-fetching for distributed shared memory. In: Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS), Washington, DC, USA, pp. 252–261. IEEE Computer Society, Los Alamitos (1995)
Tremblay, M., Chaudhry, S.: A third-generation 65nm 16-core 32-thread plus 32-scout-thread CMT SPARC processor. In: Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC) (2008)
Topham, N., Rawsthorne, A., McLean, C., Mewissen, M., Bird, P.: Compiling and optimizing for decoupled architectures. In: Proceedings of Supercomputing (SC), p. 40 (1995)
Lau, D.L., Gonzalez, J.G.: The closest-to-mean filter: an edge preserving smoother for Gaussian environments. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2593–2596. IEEE Press, Los Alamitos (1997)
Warren, H.S.: Hacker’s Delight. Addison-Wesley, Boston (2002)
Carter, L., Gatlin, K.S.: Towards an optimal bit-reversal permutation program. In: Proceedings of Foundations of Computer Science (FOCS), pp. 544–555 (1998)
Wright, C.: IBM software development kit for multicore acceleration. Roadrunner tutorial LA-UR-08-2819 (2008), http://www.lanl.gov/orgs/hpc/roadrunner
Solar-Lezama, A., Arnold, G., Tancau, L., Bodik, R., Saraswat, V., Seshia, S.: Sketching stencils. In: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation (PLDI), pp. 167–178. ACM, New York (2007)
Saltz, J.H., Mirchandaney, R., Crowley, K.: Run-time parallelization and scheduling of loops. IEEE Trans. Comput. (5), 603–612 (1991)
Fatahalian, K., et al.: Sequoia: programming the memory hierarchy. In: Proceedings of Supercomputing (SC), pp. 83–92 (2006)
Bellens, P., Perez, J.M., Badia, R.M., Labarta, J.: CellSs: a programming model for the Cell BE architecture. In: Proceedings of Supercomputing (SC), pp. 86–96 (2006)
Lokhmotov, A., Mycroft, A., Richards, A.: Delayed side-effects ease multi-core programming. In: Kermarrec, A.-M., Bougé, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 641–650. Springer, Heidelberg (2007)
Codeplay Software: Portable high-performance compilers, http://www.codeplay.com/
Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures. Morgan Kaufmann, San Francisco (2002)
Griebl, M.: Automatic Parallelization of Loop Programs for Distributed Memory Architectures. University of Passau, Habilitation Thesis (2004)
Gaster, B.R.: Streams: Emerging from a shared memory model. In: Eigenmann, R., de Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 134–145. Springer, Heidelberg (2008)
Howes, L.W., Lokhmotov, A., Kelly, P.H., Field, A.J.: Optimising component composition using indexed dependence metadata. In: Proceedings of the 1st International Workshop on New Frontiers in High-performance and Hardware-aware Computing (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Howes, L.W., Lokhmotov, A., Donaldson, A.F., Kelly, P.H.J. (2009). Deriving Efficient Data Movement from Decoupled Access/Execute Specifications. In: Seznec, A., Emer, J., O’Boyle, M., Martonosi, M., Ungerer, T. (eds) High Performance Embedded Architectures and Compilers. HiPEAC 2009. Lecture Notes in Computer Science, vol 5409. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-92990-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-540-92990-1_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-92989-5
Online ISBN: 978-3-540-92990-1
eBook Packages: Computer ScienceComputer Science (R0)