Abstract
Most MPI checkpointers are substantially or even totally dependent on a specific MPI implementation or platform, so they may not be portable. In this paper we present design and implementation issues as well as solutions to enhance portability of an MPI checkpointer. We actually developed PC/MPI (Portable Checkpointer for MPI) to which the presented solutions are applied, and verified that it is applicable to various MPI implementations and platforms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
MPI Forum. Mpi: A message-passing interface standard. International Journal of Supercomputer Applications 8(3), 165–414 (1994)
Burns, G., Daoud, R., Vaigl, J.: Lam: An open cluster environment for mpi. In: Proceedings of Supercomp. Symp. (1994)
Gropp, W., Lusk, E., Doss, N., Skjellum, A.: Mpich: A high-performance, portable implementation of the mpi message passing interface standard. Parallel computing 22(6), 789–828 (1996)
Stellner, G.: Cocheck: Checkpointing and process migration for mpi. In: Proceedings of the International Parallel Processing Symposium (1996)
Chen, Y., Plank, J.S., Li, K.: Clip: A checkpointing tool for message-passing parallel programs. In: Proceedings of the ACM/IEEE conference on Supercomputing (1997)
Bosilca, G., Bouteiller, A., Cappelllo, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V.: Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In: Proceedings of SC 2002 (2002)
Alvisi, S.L., Harrick, M.: Egida: An extensible toolkit for low-overhead faulttolerance. In: Symposium on Fault-Tolerant Computing (1999)
Batchu, R., Dandass, Y.S., Skjellum, A., Beddhu, M.: Mpi/ft: Architecture and taxonomies for fault-tolerant message-passing middleware for performanceportable parallel computing. In: 1st International Symposium on Cluster Computing and the Grid (2001)
Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999)
Tannenbaum, T., Litzkow, M.: Checkpointing and migration of unix processes in the condor distributed system. D. Dobbs Journal, 40–48 (1995)
Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed system. ACM Trans. On Computer Systems 3(1), 63–75 (1985)
Plank, R.J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under unix. In: Usenix Winter 1995 Technical Conference (1995)
Zandy, V.C., Miller, B.P., Livny, M.: Process hijacking. In: Eighth IEEE International Symposium on High Performance Distributed Computing (1999)
Baile, D., Harris, T., Saphir, W., Wijngaart, R., Woo, A., Yarrow, M.: The nas parallel benchmarks 2.0. Technical report, NSA-95-020 Ames Research Center (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ahn, S., Kim, J., Han, S. (2003). PC/MPI: Design and Implementation of a Portable MPI Checkpointer. In: Dongarra, J., Laforenza, D., Orlando, S. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2003. Lecture Notes in Computer Science, vol 2840. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39924-7_43
Download citation
DOI: https://doi.org/10.1007/978-3-540-39924-7_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20149-6
Online ISBN: 978-3-540-39924-7
eBook Packages: Springer Book Archive