Abstract
On Chip Multiprocessors (CMP), it is common that multiple cores share certain levels of cache. The sharing increases the contention in cache and memory-to-chip bandwidth, further highlighting the importance of data locality analysis.
As a rigorous and hardware-independent locality metric, reuse distance has served for a variety of locality analysis, program transformations, and performance prediction. However, previous studies have concentrated on sequential programs running on unicore processors. On CMP, accesses by different threads (or jobs) interact in the shared cache. How reuse distance applies to the new architecture remains an open question—particularly, how the interactions in shared cache affect the collection and application of reuse distance, and how reuse-distance–based locality analysis should adapt to such architecture changes.
This paper presents our explorations towards answering those questions. It first introduces the concept of concurrent reuse distance, a direct extension of the traditional concept of reuse distance with data references by all co-running threads (or jobs) considered. It then discusses the properties of concurrent reuse distance, revealing the special challenges facing the collection and application of concurrent reuse distance on CMP platforms. Finally, it presents the solutions to those challenges for a class of multithreading applications. The solutions center on a probabilistic model that connects concurrent reuse distance with the data locality of each individual thread. Experiments demonstrate the effectiveness of the proposed techniques in facilitating the uses of concurrent reuse distance for CMP computing.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Berg, E., Hagersten, E.: Fast data-locality profiling of native execution. ACM SIGMETRICS Performance Review 33, 169–180 (2005)
Beyls, K., D’Hollander, E.H.: Reuse Distance as a Metric for Cache Behavior. In: Proceedings of the IASTED Conference on Parallel and Distributed Computing and Systems (2001)
Beyls, K., D’Hollander, E.: Discovery of locality-improving refactoring by reuse path analysis. In: Gerndt, M., Kranzlmüller, D. (eds.) HPCC 2006. LNCS, vol. 4208, pp. 220–229. Springer, Heidelberg (2006)
Bienia, C., Kumar, S., Singh, J.P., Li, K.: The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques, Toronto, pp. 72–81 (2008)
Browne, S., Deane, C., Ho, G., Mucci, P.: PAPI: A portable interface to hardware performance counters. In: Proceedings of Department of Defense HPCMP Users Group Conference (1999)
Chandra, D., Guo, F., Kim, S., Solihin, Y.: Predicting inter-thread cache contention on a chip multi-processor architecture. In: Proceedings of the International Symposium on High Performance Computer Architecture (2005)
Chen, X.E., Aamodt, T.M.: A First-Order Fine-Grained Multithreaded Throughput Model. In: Proceedings of the International Symposium on High-Performance Computer Architecture, Raleigh, pp. 329–340 (2009)
Denning, P.: Thrashing: Its causes and prevention. In: Proceedings of the AFIPS 1968 Fall Joint Computer Conference (1968)
Ding, C., Zhong, Y.: Predicting Whole-Program Locality with Reuse Distance Analysis. In: Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, San Diego, pp. 245–257 (2003)
Fang, C., Carr, S., Onder, S., Wang, Z.: Instruction Based Memory Distance Analysis and its Application to Optimization. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pp. 27–37 (2005)
Fang, C., Carr, S., Onder, S., Wang, Z.: Feedback-directed Memory Disambiguation Through Store Distance Analysis. In: Proceedings of the 20th ACM International Conference on Supercomputing, Cairns, Queensland, Australia, pp. 278–287 (2006)
Fedorova, A., Seltzer, M., Smith, M.D.: Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pp. 25–38 (2007)
Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning. Springer, Heidelberg (2001)
Hsu, L.R., Reinhardt, S.K., Lyer, R., Makineni, S.: Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, Seattle, pp. 13–22 (2006)
Liao, C., Liu, Z., Huang, L., Chapman, B.: Evaluating OpenMP on Chip Multithreading Platforms. In: Proceedings of International Workshop on OpenMP (2005)
Lu, Q., Lin, J., Ding, X., Zhang, Z., Zhang, X., Sadayappan, P.: Soft-OLP: improving hardware cache performance through software-controlled object-level partitioning. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pp. 246–257 (2009)
Marin, G., Mellor-Crummey, J.: Cross architecture performance predictions for scientific applications using parameterized models. In: Proceedings of Joint International Conference on Measurement and Modeling of Computer Systems, New York, pp. 2–13 (2004)
Martin, M., Sorin, D.J., Beckmann, B.M., Marty, M.R., Xu, M., Alameldeen, A.R., Moore, K.E., Hill, M.D., Wood, D.A.: Multifacet’s General Execution-driven Multiprocessor Simulator (GEMS) Toolset. Computer Architecture News, 92–99 (2005)
Mattson, R.L., Gecsei, J., Slutz, D., Traiger, I.L.: Evaluation techniques for storage hierarchies. IBM System Journal 9(2), 78–117 (1970)
Rafique, N., Lim, W., Thottethodi, M.: Architectural support for operating system-driven CMP cache management. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pp. 2–12 (2006)
Settle, A., Kihm, J.L., Janiszewski, A., Connors, D.A.: Architectural Support for Enhanced SMT job scheduling. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pp. 63–73 (2004)
Shen, X., Shaw, J.: Scalable Implementation of Efficient Locality Approximation. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 202–216. Springer, Heidelberg (2008)
Shen, X., Shaw, J., Meeker, B., Ding, C.: Locality approximation using time. In: Proceedings of the ACM SIGPLAN Conference on Principles of Programming Languages (2007)
Shen, X., Zhong, Y., Ding, C.: Regression-based multi-model prediction of data reuse signature. In: Proceedings of the 4th Annual Symposium of the Las Alamos Computer Science Institute, Sante Fe, New Mexico (2003)
Smith, A.J.: On the Effectiveness of Set Associative Page Mapping and Its Applications in Main Memory Management. In: Proceedings of the 2nd International Conference on Software Engineering, pp. 286–292 (1976)
Snavely, A., Tullsen, D.M.: Symbiotic jobscheduling for a simultaneous multithreading processor. In: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 66–76 (2000)
Suh, G.E., Devadas, S., Rudolph, L.: Analytical Cache Models with Applications to Cache Partitioning. In: Proceedings of the 15th international conference on Supercomputing, Sorrento, Italy, pp. 1–12 (2001)
Tam, D., Azimi, R., Stumm, M.: Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. SIGOPS Oper. Syst. Rev. 41(3), 47–58 (2007)
Thiebaut, D., Stone, H.S.: Footprints in the Cache. ACM Transactions on Computer Systems 5(4) (1987)
Zhang, E.Z., Jiang, Y., Shen, X.: Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs? In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2010)
Ding, C., Chilimbi, T.: All-Window Profiling of Concurrent Executions. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 265–266 (2008)
Zhong, Y., Dropsho, S.G., Ding, C.: Miss Rate Prediction Across All Program Inputs. In: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques (2003)
Zhong, Y., Dropsho, S.G., Shen, X., Studer, A., Ding, C.: Miss rate prediction across program inputs and cache configurations. IEEE Transactions on Computers 56(3), 328–343 (2007)
Zhong, Y., Orlovich, M., Shen, X., Ding, C.: Array Regrouping and Structure Splitting using Whole-Program Reference Affinity. In: Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 255–266 (2004)
Zhong, Y., Chang, W.: Sampling-based Program Locality Approximation. In: Proceedings of the International Symposium on Memory Management (2008)
Zhong, Y., Shen, X., Ding, C.: Program Locality Analysis Using Reuse Distance. ACM Transactions on Programming Languages and Systems 31(6) (2009)
Luk, C., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In: Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (2005)
Magnusson, P.S., Christensson, M., Eskilson, J., Forsgren, D., Hållberg, G., Högberg, J., Larsson, F., Moestedt, A., Werner, B.: Simics: A Full System Simulation Platform. Computer, 50–58 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jiang, Y., Zhang, E.Z., Tian, K., Shen, X. (2010). Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors?. In: Gupta, R. (eds) Compiler Construction. CC 2010. Lecture Notes in Computer Science, vol 6011. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11970-5_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-11970-5_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11969-9
Online ISBN: 978-3-642-11970-5
eBook Packages: Computer ScienceComputer Science (R0)