Abstract
Optimizations, including tiling, often target a single level of memory or parallelism, such as cache. These optimizations usually operate on a level-by-level basis, guided by a cost function parameterized by features of that single level. The benefit of optimizations guided by these one-level cost functions decreases as architectures tend towards a hierarchy of memory and of parallelism. We have identified three common architectural scenarios where a single tiling choice could be improved by using information from multiple levels in concert. For each scenario, we derive multi-level cost functions which guide the optimal choice of tile size and shape, and quantify the improvement gained. We give both analysis and simulation results to support our points.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
REFERENCES
Michael E. Wolf and Monica S. Lam, A data locality optimizing algorithm, Progr. Lang. Design Implementation (1991).
Steve Carr and Ken Kennedy, Compiler blockability of numerical algorithms, J. Supercomputing, pp. 114–124 (November 1992).
Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng, Compiler optimizations for improving data locality, Sixth Int'l. Conf. Archit. Support Progr. Lang. Oper. Syst., San Jose, California, Oct. 1994.
Steve Carr and Ken Kennedy, Improving the ratio of memory operations to floatingpoint operations in loops, Trans. Progr. Lang. Syst. 16(6):1768–1810 (November 1994).
Corinne Ancourt and François Irigoin, Scanning polyhedra with DO loops, Principles and Practice of Parallel Progr., pp. 39–50 ( April 1991).
Michael E. Wolf and Monica S. Lam, A loop transformation theory and an algorithm to maximize parallelism, IEEE Trans. Parallel Distrib. Syst. 2(4):452–471 (1991).
Paul Feautrier, Some efficient solutions to the affine scheduling problem, Part I, one-dimensional time, IJPP 21(5):xx-xx (October 1992).
Wayne Kelly and William Pugh, A unifying framework for iteration reordering transformations, IEEE First Int'l. Conf. Algorithms and Architectures for Parallel Processing (April 1995).
Daniel Lavery and Wen-mei Hwu, Unrolling-based optimizations for modulo scheduling, 28th Int'l. Symp. Microarchit., pp. 126–141 (December 1995).
Stephanie Coleman and Kathryn S. McKinley, Tile size selection using cache organization and data layout, Progr. Lang. Design and Implementation (June 1995).
Vivek Sarkar, Guang R. Gao, and Shaohua Han, Locality analysis for distributed shared-memory multiprocessors, Lang. Compilers for Parallel Computing (1996).
Dennis Gannon and Ko-Yang Wang, Applying AI Techniques to Program Optimization for Parallel Computers, Chap. 12, McGraw Hill Co. (1989).
Michael E. Wolf, Dror Maydan, and Ding-Kai Chen, Combining loop transformations considering caches and scheduling, 29th Int'l. Symp. Microarchit. (December 1996).
Michael J. Wolfe, Iteration space tiling for memory hierarchies, Parallel Processing for Sci. Comput., pp. 357–361 (1987).
J. Ramanujam and P. Sadayappan, Tiling multidimensional iteration spaces for nonshared memory machines, Supercomputing (November 1991).
David A. Padua and Michael J. Wolfe, Advanced compiler optimizations for supercomputers, Commun. ACM 29(12):1184–1201 (December 1986).
Dennis Gannon, William Jalby, and Kyle Gallivan, Strategies for cache and local memory management by global program transformation, J. Parallel and Distrib. Comput., Vol. 5, No.5 (October 1988).
François Irigoin and Rémi Triolet, Supernode partitioning, Principles of Progr. Lang., pp. 319–328 (January 1988).
Michael J. Wolfe, More iteration space tiling, Supercomputing, pp. 655–664 (1989).
Monica S. Lam, Edward E. Rothberg, and Michael E. Wolf, The cache performance and optimizations of blocked algorithms, ASPLOS-IV, Palo Alto, California (April 1991).
Utpal Banerjee, Unimodular transformations of double loops, in Progr. Lang. Compilers for Parallel Computing, Irvine, California (August 1990).
Ken Kennedy and Kathryn S. McKinley, Optimizing for parallelism and data locality, Int'l. Conf. Supercomputing (July 1992).
Jeanne Ferrante, Vivek Sarkar, and Wedy Thrash, On estimating and enhancing cache effectiveness, Lang. Compilers for Parallel Computing (1991).
Anant Agarwal, David Kranz, and Venkat Natarajan, Automatic partitioning of parallel loops and data arrays for distributed shared memory multiprocessors, Int'l. Conf. Parallel Computing (1993).
Vivek Sarkar and Radhika Thekkath, A general framework for iteration-reordering loop transformations, Technical Summary, Progr. Lang. Design and Implementation (1992).
Steve Carr, Combining optimization for cache and instruction-level parallelism, PACT '96, pp. 238–247 (1996).
Ken Kennedy and Kathryn S. McKinley, Maximizing loop parallelism and improving data locality via loop fusion and distribution, Lang. Compilers for Parallel Computing (1993).
Jeff Bilmes, Krste Asanović, Chee-Whye Chin, and Jim Demmel, Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology, Int'l. Conf. Supercomputing (1997).
Larry Carter, Jeanne Ferrante, Susan Flynn Hummel, Bowen Alpern, and Kang Su Gatlin, Hierarchical tiling: A methodology for high performance, Technical Report CS96–508, UCSD, Department of Computer Science and Engineering (November 1996).
Doug Burger and Todd Austin, The SimpleScalar architectural research tool set, Version 2.0, http://www.cs.wisc.edu/mscalar/simplescalar.html
Karin Högstedt, Larry Carter, and Jeanne Ferrante, Determining the idle time of a tiling, Principles of Progr. Lang. (1997).
Larry Carter, Jeanne Ferrante, and S. Flynn Hummel, Hierarchical tiling for improved superscalar performance, Int'l. Parallel Processing Symp. (April 1995).
Rights and permissions
About this article
Cite this article
Mitchell, N., Högstedt, K., Carter, L. et al. Quantifying the Multi-Level Nature of Tiling Interactions. International Journal of Parallel Programming 26, 641–670 (1998). https://doi.org/10.1023/A:1018782528453
Issue Date:
DOI: https://doi.org/10.1023/A:1018782528453