Abstract
Domain decomposition for regular meshes on parallel computers has traditionally been performed by attempting to exactly partition the work among the available processors (now cores). However, these strategies often do not consider the inherent system noise which can hinder MPI application scalability to emerging peta-scale machines with 10000+ nodes. In this work, we suggest a solution that uses a tunable hybrid static/dynamic scheduling strategy that can be incorporated into current MPI implementations of mesh codes. By applying this strategy to a 3D jacobi algorithm, we achieve performance gains of at least 16% for 64 SMP nodes.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Williams, S., Carter, J., Oliker, L., Shalf, J., Yelick, K.A.: Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms. Journal of Parallel and Distributed Computing (2009)
Cappello, F., Etiemble, D.: MPI versus MPI+OpenMP on IBM SP for the NAS benchmarks. In: Supercomputing 2000: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), Washington, DC, USA. IEEE Computer Society, Los Alamitos (2000)
Mann, P.D.V., Mittaly, U.: Handling OS jitter on multicore multithreaded systems. In: IPDPS 2009: Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing, Washington, DC, USA. IEEE Computer Society Press, Los Alamitos (2009)
Shi, G., Kindratenko, V., Gottlieb, S.: The bottom-up implementation of one MILC lattice QCD application on the Cell blade. International Journal of Parallel Programming 37 (2009)
Kamil, S., Chan, C., Williams, S., Oliker, L., Shalf, J., Howison, M., Bethel, E.W.: A generalized framework for auto-tuning stencil computations. In: Proceedings of the Cray User Group Conference (2009)
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing (1995)
Song, F., YarKhan, A., Dongarra, J.: Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In: SC 2009: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, New York (2009)
Petrini, F., Kerbyson, D.J., Pakin, S.: The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In: SC 2003: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, Washington, DC, USA, IEEE Computer Society Press, Los Alamitos (2003)
Klug, T., Ott, M., Weidendorfer, J., Trinitis, C., Müchen, T.U.: Autopin, automated optimization of thread-to-core pinning on multicore systems (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kale, V., Gropp, W. (2010). Load Balancing for Regular Meshes on SMPs with MPI. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2010. Lecture Notes in Computer Science, vol 6305. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15646-5_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-15646-5_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15645-8
Online ISBN: 978-3-642-15646-5
eBook Packages: Computer ScienceComputer Science (R0)