Abstract
Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi- and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4–16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor numerous lighter cores with less resources, further reducing support for MLP on a per-core basis. Support for hardware and software prefetch increases MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% in run-time on average across benchmarks and the state-of-the art GCC implementation by up to 34.79%, depending upon hardware configuration. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show run-time improvements of up to 24.61%. To demonstrate the robustness of our approach, we conduct a design-space exploration (DSE) for the considered target architecture by varying (i) the amount of chip resources designated for per-core prefetch storage and (ii) off-chip bandwidth. We show that the RAP algorithm is robust in that it improves performance across all design points considered. We also identify the Pareto-optimal hardware-software configuration which delivers 53.66% run-time improvement on average while using only 5.47% more chip area than the bare-bones design.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Tilera corporation, http://www.tilera.com
Software release of the explicit multi-threading (XMT) programming environment. http://www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html (October 2010), v0.82
Balkan, A.O., Horak, M.N., Qu, G., Vishkin, U.: Layout-accurate design and implementation of a high-throughput interconnection network for single-chip parallel processing. hoti pp. 21–28 (2007)
Boggs D., Baktha A., Hawkins J., Marr D.T., Miller J.A., Roussel P., Singhal R., Toll B., Venkatraman K.: The microarchitecture of the intel pentium 4 processor on 90nm technology. Intel Technol. J. 8(1), 7–23 (2004) (February)
Caragea, G.C., Keceli, F., Tzannes, A., Vishkin, U.: General-purpose vs. GPU: Comparison of many-cores on irregular workloads. In: HotPar ’10: Proceedings of the 2nd Workshop on Hot Topics in Parallelism. USENIX (June 2010)
Caragea, G.C., Saybasili, A.B., Wen, X., Vishkin, U.: Brief announcement: performance potential of an easy-to-program pram-on-chip prototype versus state-of-the-art processor. In: SPAA ’09: Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, pp. 163–165. ACM, New York (2009)
Chen, W.Y., Mahlke, S.A., Chang, P.P., Hwu, W.M.W.: Data access microarchitectures for superscalar processors with compiler-assisted data prefetching. In: MICRO 24: Proceedings of the 24th annual international symposium on Microarchitecture, pp. 69–73. ACM Press, New York (1991)
Dahlgren F., Dubois M., Stenström P.: Sequential hardware prefetching in shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 6(7), 733–746 (1995)
Gebhart, M., Maher, B.A., Coons, K.E., Diamond, J., Gratz, P., Marino, M., Ranganathan, N., Robatmili, B., Smith, A., Burrill, J., Keckler, S.W., Burger, D., McKinley, K.S.: An evaluation of the TRIPS computer system. In: ASPLOS ’09: Proceeding of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 1–12. ACM, New York (2009)
Givargis, T., Vahid, F., Henkel, J.: System-level exploration for pareto-optimal configurations in parameterized systems-on-a-chip. Computer Aided Design, 2001. ICCAD 2001. IEEE/ACM International Conference, pp. 25–30 (2001)
Hochstein L., Basili V.R., Vishkin U., Gilbert J.: A pilot study to compare programming effort for two parallel programming models. J. Syst. Softw. 81(11), 1920–1930 (2008)
Huh, J., Burger, D., Keckler, S.W.: Exploring the design space of future CMPs. Parallel Architectures and Compilation Techniques, International Conference, 0199 (2001)
Jahre, M., Natvig, L.: A high performance adaptive miss handling architecture for chip multiprocessors. Transactions on High-Performance Embedded Architectures and Compilers 4(1) (2009)
Jouppi, N.P.: Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In: ISCA ’90: Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 364–373. ACM Press, New York (1990)
Klaiber, A.C., Levy, H.M.: An architecture for software-controlled data prefetching. In: ISCA ’91: Proceedings of the 18th Annual International Symposium on Computer Architecture, pp. 43–53. ACM Press, New York (1991)
Li, Y., Lee, B., Brooks, D., Hu, Z., Skadron, K.: CMP design space exploration subject to physical constraints. High-Performance Computer Architecture, 2006. The Twelfth International Symposium, pp. 17–28 (2006)
Lieverse, P., van der Wolf, P., Deprettere, E., Vissers, K.: A methodology for architecture exploration of heterogeneous signal processing systems. Signal Processing Systems, 1999. SiPS 99. 1999 IEEE Workshop, pp. 181–190 (1999)
Lin, W.F., Reinhardt, S.K., Burger, D.: Reducing dram latencies with an integrated memory hierarchy design. hpca 00, 0301 (2001)
McIntosh, N.: Compiler support for software prefetching. Ph.D. thesis, Rice University, adviser-Ken Kennedy (1998)
Mowry T.C.: Tolerating latency in multiprocessors through compiler-inserted prefetching. ACM Trans. Comput. Syst. 16(1), 55–92 (1998)
Mowry T.C., Lam M.S., Gupta A.: Design and evaluation of a compiler algorithm for prefetching. SIGPLAN Not. 27(9), 62–73 (1992)
Mowry, T.C.: Tolerating latency through software-controlled data prefetching. Ph.D. thesis, Stanford, CA, USA (1995)
Naishlos, D., Nuzman, J., Tseng, C.W., Vishkin, U.: Towards a first vertical prototyping of an extremely fine-grained parallel programming approach. In: SPAA ’01: Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 93–102. ACM, New York (2001)
Nawathe U., Hassan M., Yen K., Kumar A., Ramachandran A., Greenhill D.: Implementation of an 8-core, 64-thread, power-efficient sparc server on a chip. Solid-State Circuits IEEE J. 43(1), 6–20 (2008)
Porterfield, A., Fowler, R., Mandel, A., Lim, M.Y.: Empirical evaluation of multi-socket, multi-core memory concurrency. Tech. Rep. RENCI TR-09-01, Renaissance Computing Institute (2009), http://www.renci.org/publications/techreports/TR-09-01.pdf
Qureshi, M.K., Lynch, D.N., Mutlu, O., Patt, Y.N.: A case for mlp-aware cache replacement. In: ISCA ’06: Proceedings of the 33rd Annual International Symposium on Computer Architecture, pp. 167–178. IEEE Computer Society, Washington, DC (2006)
Smith A.J.: Cache memories. ACM Comput. Surv. 14(3), 473–530 (1982)
Taylor M.B., Kim J., Miller J., Wentzlaff D., Ghodrat F., Greenwald B., Hoffman H., Johnson P., Lee J.W., Lee W., Ma A., Saraf A., Seneski M., Shnidman N., Strumpen V., Frank M., Amarasinghe S., Agarwal A.: The raw microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE Micro. 22(2), 25–35 (2002)
Torbert, S., Vishkin, U., Tzur, R., Ellison, D.: Is teaching parallel algorithmic thinking to high-school student possible? One teacher’s experience. In: Proceedings of 41st ACM Technical Symposium on Computer Science Education (SIG CSE). Milwaukee, WI (2010)
Tuck, J., Ceze, L., Torrellas, J.: Scalable cache miss handling for high memory-level parallelism. In: MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 409–422. IEEE Computer Society, Washington, DC (2006)
Tullsen D.M., Eggers S.J.: Effective cache prefetching on bus-based multiprocessors. ACM Trans. Comput. Syst. 13(1), 57–88 (1995)
Vishkin U., Caragea G.C., Lee B.C.: Handbook of Parallel Computing: Models, Algorithms and Applications, chap. Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform. CRC Press, London (2007)
Vishkin, U., Dascal, S., Berkovich, E., Nuzman, J.: Explicit multi-threading (XMT) bridging models for instruction parallelism (extended abstract). In: SPAA ’98: Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures, pp. 140–151. ACM, New York (1998)
Wen, X.: Hardware Design, Prototyping and Studies of the Explicit Multi-Threading (XMT) Paradigm. Ph.D. thesis, University of Maryland (2008)
Wen, X., Vishkin, U.: Pram-on-chip: first commitment to silicon. In: SPAA ’07: Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 301–302. ACM Press, New York (2007)
Wen, X., Vishkin, U.: Fpga-based prototype of a pram-on-chip processor. In: CF ’08: Proceedings of the 2008 Conference on Computing Frontiers, pp. 55–66. ACM, New York (2008)
Yang, C., Yang, X., Xue, J.: Advances in Computer Systems Archiecture, vol. 3740/2005, chap. Improving the Performance of GCC by Exploiting IA-64 Architectural Features, pp. 236–251. Springer, Berlin (2005)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Caragea, G.C., Tzannes, A., Keceli, F. et al. Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores. Int J Parallel Prog 39, 615–638 (2011). https://doi.org/10.1007/s10766-011-0163-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-011-0163-8