Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

Caragea, George C.; Tzannes, Alexandros; Keceli, Fuat; Barua, Rajeev; Vishkin, Uzi

doi:10.1007/s10766-011-0163-8

Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

Published: 01 March 2011

Volume 39, pages 615–638, (2011)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

International Journal of Parallel Programming Aims and scope Submit manuscript

Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

Download PDF

George C. Caragea¹,
Alexandros Tzannes¹,
Fuat Keceli²,
Rajeev Barua^2,3 &
…
Uzi Vishkin^2,4

135 Accesses
Explore all metrics

Abstract

Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi- and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4–16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor numerous lighter cores with less resources, further reducing support for MLP on a per-core basis. Support for hardware and software prefetch increases MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% in run-time on average across benchmarks and the state-of-the art GCC implementation by up to 34.79%, depending upon hardware configuration. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show run-time improvements of up to 24.61%. To demonstrate the robustness of our approach, we conduct a design-space exploration (DSE) for the considered target architecture by varying (i) the amount of chip resources designated for per-core prefetch storage and (ii) off-chip bandwidth. We show that the RAP algorithm is robust in that it improves performance across all design points considered. We also identify the Pareto-optimal hardware-software configuration which delivers 53.66% run-time improvement on average while using only 5.47% more chip area than the bare-bones design.

Article PDF

Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

Article 29 April 2016

Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture

Article Open access 15 May 2021

Stream data prefetcher for the GPU memory interface

Article 27 January 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Tilera corporation, http://www.tilera.com
Software release of the explicit multi-threading (XMT) programming environment. http://www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html (October 2010), v0.82
Balkan, A.O., Horak, M.N., Qu, G., Vishkin, U.: Layout-accurate design and implementation of a high-throughput interconnection network for single-chip parallel processing. hoti pp. 21–28 (2007)
Boggs D., Baktha A., Hawkins J., Marr D.T., Miller J.A., Roussel P., Singhal R., Toll B., Venkatraman K.: The microarchitecture of the intel pentium 4 processor on 90nm technology. Intel Technol. J. 8(1), 7–23 (2004) (February)
Google Scholar
Caragea, G.C., Keceli, F., Tzannes, A., Vishkin, U.: General-purpose vs. GPU: Comparison of many-cores on irregular workloads. In: HotPar ’10: Proceedings of the 2nd Workshop on Hot Topics in Parallelism. USENIX (June 2010)
Caragea, G.C., Saybasili, A.B., Wen, X., Vishkin, U.: Brief announcement: performance potential of an easy-to-program pram-on-chip prototype versus state-of-the-art processor. In: SPAA ’09: Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, pp. 163–165. ACM, New York (2009)
Chen, W.Y., Mahlke, S.A., Chang, P.P., Hwu, W.M.W.: Data access microarchitectures for superscalar processors with compiler-assisted data prefetching. In: MICRO 24: Proceedings of the 24th annual international symposium on Microarchitecture, pp. 69–73. ACM Press, New York (1991)
Dahlgren F., Dubois M., Stenström P.: Sequential hardware prefetching in shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 6(7), 733–746 (1995)
Article Google Scholar
Gebhart, M., Maher, B.A., Coons, K.E., Diamond, J., Gratz, P., Marino, M., Ranganathan, N., Robatmili, B., Smith, A., Burrill, J., Keckler, S.W., Burger, D., McKinley, K.S.: An evaluation of the TRIPS computer system. In: ASPLOS ’09: Proceeding of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 1–12. ACM, New York (2009)
Givargis, T., Vahid, F., Henkel, J.: System-level exploration for pareto-optimal configurations in parameterized systems-on-a-chip. Computer Aided Design, 2001. ICCAD 2001. IEEE/ACM International Conference, pp. 25–30 (2001)
Hochstein L., Basili V.R., Vishkin U., Gilbert J.: A pilot study to compare programming effort for two parallel programming models. J. Syst. Softw. 81(11), 1920–1930 (2008)
Article Google Scholar
Huh, J., Burger, D., Keckler, S.W.: Exploring the design space of future CMPs. Parallel Architectures and Compilation Techniques, International Conference, 0199 (2001)
Jahre, M., Natvig, L.: A high performance adaptive miss handling architecture for chip multiprocessors. Transactions on High-Performance Embedded Architectures and Compilers 4(1) (2009)
Jouppi, N.P.: Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In: ISCA ’90: Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 364–373. ACM Press, New York (1990)
Klaiber, A.C., Levy, H.M.: An architecture for software-controlled data prefetching. In: ISCA ’91: Proceedings of the 18th Annual International Symposium on Computer Architecture, pp. 43–53. ACM Press, New York (1991)
Li, Y., Lee, B., Brooks, D., Hu, Z., Skadron, K.: CMP design space exploration subject to physical constraints. High-Performance Computer Architecture, 2006. The Twelfth International Symposium, pp. 17–28 (2006)
Lieverse, P., van der Wolf, P., Deprettere, E., Vissers, K.: A methodology for architecture exploration of heterogeneous signal processing systems. Signal Processing Systems, 1999. SiPS 99. 1999 IEEE Workshop, pp. 181–190 (1999)
Lin, W.F., Reinhardt, S.K., Burger, D.: Reducing dram latencies with an integrated memory hierarchy design. hpca 00, 0301 (2001)
McIntosh, N.: Compiler support for software prefetching. Ph.D. thesis, Rice University, adviser-Ken Kennedy (1998)
Mowry T.C.: Tolerating latency in multiprocessors through compiler-inserted prefetching. ACM Trans. Comput. Syst. 16(1), 55–92 (1998)
Article Google Scholar
Mowry T.C., Lam M.S., Gupta A.: Design and evaluation of a compiler algorithm for prefetching. SIGPLAN Not. 27(9), 62–73 (1992)
Article Google Scholar
Mowry, T.C.: Tolerating latency through software-controlled data prefetching. Ph.D. thesis, Stanford, CA, USA (1995)
Naishlos, D., Nuzman, J., Tseng, C.W., Vishkin, U.: Towards a first vertical prototyping of an extremely fine-grained parallel programming approach. In: SPAA ’01: Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 93–102. ACM, New York (2001)
Nawathe U., Hassan M., Yen K., Kumar A., Ramachandran A., Greenhill D.: Implementation of an 8-core, 64-thread, power-efficient sparc server on a chip. Solid-State Circuits IEEE J. 43(1), 6–20 (2008)
Article Google Scholar
Porterfield, A., Fowler, R., Mandel, A., Lim, M.Y.: Empirical evaluation of multi-socket, multi-core memory concurrency. Tech. Rep. RENCI TR-09-01, Renaissance Computing Institute (2009), http://www.renci.org/publications/techreports/TR-09-01.pdf
Qureshi, M.K., Lynch, D.N., Mutlu, O., Patt, Y.N.: A case for mlp-aware cache replacement. In: ISCA ’06: Proceedings of the 33rd Annual International Symposium on Computer Architecture, pp. 167–178. IEEE Computer Society, Washington, DC (2006)
Smith A.J.: Cache memories. ACM Comput. Surv. 14(3), 473–530 (1982)
Article Google Scholar
Taylor M.B., Kim J., Miller J., Wentzlaff D., Ghodrat F., Greenwald B., Hoffman H., Johnson P., Lee J.W., Lee W., Ma A., Saraf A., Seneski M., Shnidman N., Strumpen V., Frank M., Amarasinghe S., Agarwal A.: The raw microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE Micro. 22(2), 25–35 (2002)
Article Google Scholar
Torbert, S., Vishkin, U., Tzur, R., Ellison, D.: Is teaching parallel algorithmic thinking to high-school student possible? One teacher’s experience. In: Proceedings of 41st ACM Technical Symposium on Computer Science Education (SIG CSE). Milwaukee, WI (2010)
Tuck, J., Ceze, L., Torrellas, J.: Scalable cache miss handling for high memory-level parallelism. In: MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 409–422. IEEE Computer Society, Washington, DC (2006)
Tullsen D.M., Eggers S.J.: Effective cache prefetching on bus-based multiprocessors. ACM Trans. Comput. Syst. 13(1), 57–88 (1995)
Article Google Scholar
Vishkin U., Caragea G.C., Lee B.C.: Handbook of Parallel Computing: Models, Algorithms and Applications, chap. Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform. CRC Press, London (2007)
Google Scholar
Vishkin, U., Dascal, S., Berkovich, E., Nuzman, J.: Explicit multi-threading (XMT) bridging models for instruction parallelism (extended abstract). In: SPAA ’98: Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures, pp. 140–151. ACM, New York (1998)
Wen, X.: Hardware Design, Prototyping and Studies of the Explicit Multi-Threading (XMT) Paradigm. Ph.D. thesis, University of Maryland (2008)
Wen, X., Vishkin, U.: Pram-on-chip: first commitment to silicon. In: SPAA ’07: Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 301–302. ACM Press, New York (2007)
Wen, X., Vishkin, U.: Fpga-based prototype of a pram-on-chip processor. In: CF ’08: Proceedings of the 2008 Conference on Computing Frontiers, pp. 55–66. ACM, New York (2008)
Yang, C., Yang, X., Xue, J.: Advances in Computer Systems Archiecture, vol. 3740/2005, chap. Improving the Performance of GCC by Exploiting IA-64 Architectural Features, pp. 236–251. Springer, Berlin (2005)

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Maryland, College Park, MD, USA
George C. Caragea & Alexandros Tzannes
Department of Electrical and Computer Engineering, University of Maryland, College Park, MD, USA
Fuat Keceli, Rajeev Barua & Uzi Vishkin
Institute for Systems Research, University of Maryland, College Park, MD, USA
Rajeev Barua
Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA
Uzi Vishkin

Authors

George C. Caragea
View author publications
You can also search for this author in PubMed Google Scholar
Alexandros Tzannes
View author publications
You can also search for this author in PubMed Google Scholar
Fuat Keceli
View author publications
You can also search for this author in PubMed Google Scholar
Rajeev Barua
View author publications
You can also search for this author in PubMed Google Scholar
Uzi Vishkin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George C. Caragea.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Caragea, G.C., Tzannes, A., Keceli, F. et al. Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores. Int J Parallel Prog 39, 615–638 (2011). https://doi.org/10.1007/s10766-011-0163-8

Download citation

Received: 20 October 2010
Accepted: 30 December 2010
Published: 01 March 2011
Issue Date: October 2011
DOI: https://doi.org/10.1007/s10766-011-0163-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

Abstract

Article PDF

Similar content being viewed by others

Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture

Stream data prefetcher for the GPU memory interface

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

Abstract

Article PDF

Similar content being viewed by others

Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture

Stream data prefetcher for the GPU memory interface

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation