Abstract
On cc-NUMA multi-processors, the non-uniformity of main memory latencies motivates the need for co-location of threads and data. We call this special form of data locality, geographical locality. In this article, we study the performance of a parallel PDE solver with adaptive mesh refinement (AMR). The solver is parallelized using OpenMP and the adaptive mesh refinement makes dynamic load balancing necessary. Due to the dynamically changing memory access pattern caused by the runtime adaption, it is a challenging task to achieve a high degree of geographical locality. The main conclusions of the study are: (1) that geographical locality is very important for the performance of the solver, (2) that the performance can be improved significantly using dynamic page migration of misplaced data, (3) that a migrate-on-next-touch directive works well whereas the first-touch strategy is less advantageous for programs exhibiting a dynamically changing memory access patterns, and (4) that the overhead for such migration is low compared to the total execution time.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Wilson, K. M. Aglietti, B. B.: Dynamic page placement to improve locality in CC-NUMA multiprocessors for TPC-C. In: Supercomputing ’01: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing, pp. 33–33. ACM Press, New York, NY, USA (2001)
Corbalan, J., Martorell, X., Labarta, J.: Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000. In: Proceedings of the 17th Annual International Conference on Supercomputing, pp. 121–129. ACM Press (2003)
Holmgren S., Nordén M., Rantakokko J., Wallin D. (2002). Performance of PDE solvers on a self-optimizing NUMA architecture. Parallel Algor. Appl. 17(4): 285–299
Mark Bull, J., Johnson, C.: Data Distribution, Migration and Replication on a cc-NUMA Architecture. In: Proceedings of the Fourth European Workshop on OpenMP. http://www.caspur.it/ewomp2002/ (2002)
Rendleman C.A. (2000). Parallelization of structured, hierarchical adaptive mesh refinement algorithms. Comput Visual Sci 3: 147–157
Deiterding, R.: Construction and application of an amr algorithm for distributed memory computers. In: Adaptive Mesh Refinement – Theory and Applications, Proc. of the Chicago Workshop on Adaptive Mesh Refinement Methods, pp. 361–372. Springer (2003)
MacNeice P. (2000). Paramesh: a parallel adaptive mesh refinement community toolkit. Comput phys communi 126: 330–354
Parashar, M., Browne, J.: System engineering for high performance computing software: the hdda/dagh infrastructure for implementation of parallel structured adaptive mesh refinement. In: IMA Volume on Structured Adaptive Mesh Refinement (SAMR) Grid Methods, pp. 1–18 (2000)
Colella, P., Graves, D.T., Ligocki, T.J., Martin, D.F., Modiano, D., Serafini, D.B., Straalen, B.V.:Chombo Software Package for AMR Applications – Design Document. Applied Numerical Algorithms Group, NERSC Division, Lawrence Berkeley National Laboratories (2000)
Wissink, A.M., Hornung, R.D., Kohn, S.R., Smith, S.S., Elliott, N.: Large scale parallel structured amr calculations using the samrai framework. In: proceedings of SC2001 (2001)
Steensland, J.: Efficient partitioning of structured dynamic grid hierarchies. Doctoral thesis. Scientific Computing, Department of Information Technology, University of Uppsala. Uppsala dissertations from the Faculty of Science and Technology 44 (2002)
Schloegel, K., Karypis, G., Kumar, V.: A unified algorithm for load-balancing adaptive scientific simulations. In: Proceedings Supercomputing 2000 (2000)
Dreher J., Grauer R. (2005). Racoon: a parallel mesh-adaptive framework for hyperbolic conservation laws. Parallel Comput. 31: 913–932
Maerten, B.: Drama: a library for parallel dynamic load balancing of finite element applications. In: Lecture Notes in Computer Science, Vol. 1685, pp. 313–316 (1999)
Walshaw C., Cross M., Everett M.G. (1997). Parallel dynamic graph partitioning for adaptive unstructured meshes. Parallel Distributed Comput. 47(2): 102–108
Rantakokko J. (2000). Partitioning strategies for structured multiblock grids. Parallel Comput. 26: 1661–1680
Steensland, J., Söderberg, S., Thuné, M.: A comparison of partitioning schemes for blockwise parallel samr algorithms. In: Lecture Notes in Computer Science, Vol. 1947, pp. 160–169 (2001)
Balsara D.S., Norton C.D. (2001). Highly parallel structured adaptive mesh refinement using parallel language-based approaches. Parallel Comput. 27: 37–70
Rantakokko, J.: Comparison of parallelization models for structured adaptive mesh refinement. In: Lecture Notes in Computer Science, Vol. 3149, pp. 615–623 (2004)
Blikberg, R.: Nested Parallelism in OpenMP with Application to Adaptive Mesh Refinement. PhD thesis, Parallab/Department of Informatics, University of Bergen, Norway, Februariy 2003 (2003)
Blikberg R., Sørevik T. (2005). Load balancing and openmp implementation of nested parallelism. Parallel Comput. 31(10-12): 984–998
Ferm L., Lötsetdt P. (2006). Space–time adaptive solutions of first order pdes. J. Sci. Comput. 26(1): 83–110
Karypsis G., Kumar V. (1999). A fast and highly qualitymultilevel scheme for partitioning irregular gra phs. SIAM J. Sci. Comput. 20(1): 359–392
Sun Microsystems, http://www.sun.com/servers/wp/docs/mpo_v7_CUSTOMER.pdf. Solaris Memory Placement Optimization and Sun Fire servers, January 2003 (2003)
Teller P.J. (1990). Translation-lookaside buffer consistency. Computer 23(6): 26–36
Löf, H., Holmgren, S.: Affinity-on-next-touch: increasing the performance of an industrial pde solver on a cc-numa system. In: ICS ’05: Proceedings of the 19th Annual International Conference onSupercomputing, pp. 387–392. ACM Press, New York, NY, USA (2005)
Bircsak J., Craig P., Crowell R., Cvetanovic Z., Harris J., Alexander Nelson C., Offner C.D. (2000). Extending OpenMP for NUMA machines. Sci. Program, 8: 163–181
Laudon, J., Lenoski, D.: The SGI Origin: a ccNUMA highly scalable server. In: Proceedings of the 24th Annual International Symposium on Computer architecture, pp. 241–251. ACM Press (1997)
Tikir, M.M., Hollingsworth, J.K.: Using hardware counters to automatically improve memory performance. In: SC ’04: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, p. 46. IEEE Computer Society, Washington, DC, USA (2004)
Spiegel, A., an Mey, D.: Hybrid Parallelization with Dynamic Thread Balancing on a ccNUMA system. In: Brorson M. (ed.) Proceedings of the 6th European Workshop on OpenMP, pp. 77–81. Royal Institute of Technology (KTH), Sweden (2004)
Löf H., Nordén M., Holmgren S. (2004). Improving geographical locality of data for shared memory implementations of PDE solvers. In: Sloth, P.M.A., Tan, C.J.K., Dongarra, J.J., and Hoekstra, A.G. (eds) Computational Science – ICCS 2004, Part II, pp 9–16. Springer-Verlag, Berlin
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nordén, M., Löf, H., Rantakokko, J. et al. Dynamic Data Migration for Structured AMR Solvers. Int J Parallel Prog 35, 477–491 (2007). https://doi.org/10.1007/s10766-007-0056-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-007-0056-z