Abstract
Consisting of large numbers of computing nodes, parallel cluster systems have high risks of individual node failure. To overcome the high overhead drawbacks of current fault tolerant MPI systems, this paper presents TH-MPI for parallel cluster systems. Being integrated into Linux kernel, TH-MPI is implemented in a more effective, transparent and extensive way. With supports of dynamic kernel module and diskless checkpointing technologies, our experiment shows that checkpointing in TH-MPI is effectively optimized.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
G. Stellner, “CoCheck: Checkpointing and Process Migration for MPI”, In Proceedings of the Int’l Parallel Processing Symposium, pp 526–531, 1996.
A. Agbaria and R. Friedman, “Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations”, In the 8th IEEE Int’l Symposium on High Performance Distributed Computing, 1999.
M. Kim and S. Kim, “(Kool MPI): Toward an optimized MPI implementation for the Linux clusters”, Technical Report, Sejong University, Korea, 2000
M. Litzkow, M. Livny, and M. Mutka, “Condor: A hunter of idle workstations”, In Proc. of the 8th Int’l Conference on Distributed Computing Systems (ICDCS’88), 1988.
J. S. Plank, M. Bech, G. Kingsley, and K. Li, “Libckpt: transparent Checkpointing Under UNIX”, In Usenix inter 1995 Technical Conference, pp 220–232, 1995.
E. Pinheiro, “Truly-Transparent Checkpointing of Parallel Applications”, Technical Report, Rutgers University, 1999
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, Y., Fang, Q., Du, Z., Li, S. (2001). TH-MPI: OS Kernel Integrated Fault Tolerant MPI. In: Cotronis, Y., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2001. Lecture Notes in Computer Science, vol 2131. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45417-9_15
Download citation
DOI: https://doi.org/10.1007/3-540-45417-9_15
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42609-7
Online ISBN: 978-3-540-45417-5
eBook Packages: Springer Book Archive