Abstract
Embedded high performance computing is being called upon to provide critical computing resources with increasing frequency. The ability to tolerate faults during operation, both maintaining operational capability and ensuring that correct results continue to be produced, is an important ingredient in mission-critical systems. An architecture for such a system is proposed, providing the ability to withstand faults with graceful degradation in performance and complete transparency to the applications programmer. The final system will be able to offer fault-tolerant computing transparently to MPI applications and draws heavily on existing, demonstrated successes.
This work was funded in part by NSF Grant No. EEC-8907070 Amendment 021 and by ONR Grant No. N00014-97-1-0116.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
Bibliography
Samuel H. Russ, Brian Flachs, Jonathan Robinson, and Bjorn Heckel, “Hector: Automated Task Allocation for MPI”, Proceedings of the 10th International Parallel Processing Symposium, Honolulu, HI, April 1996.
Guerraoui, R., and Schiper, A., “Software-based Replication for Fault Tolerance”, Computer, Vol. 30, No. 4, April 1997, pp. 68–74.
Jonathan Robinson, Samuel H. Russ, Brian Flachs, and Bjorn Heckel, “A Task Migration Implementation for the Message-Passing Interface”, Proceedings of the IEEE 5th High Performance Distributed Computing Conference (HPDC-5), Syracuse, NY, August 1996.
Dr. Samuel H. Russ, “Using Hector in an Architecture for Rapid Distributed Fault Tolerance”, MSU Technical Report No. MSSU-EIRS-ERC-97-17, December 1997.
Dr. Samuel H. Russ, Brad Meyers, Chun-Heong Tan, and Bjorn Heckel, “UserTransparent Run-time Performance Optimization”, 2nd International Workshop on Embedded High Performance Computing, associated with IPPS '97, Geneva, Switzerland, April 1997.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Russ, S.H. (1998). An architecture for rapid distributed fault tolerance. In: Rolim, J. (eds) Parallel and Distributed Processing. IPPS 1998. Lecture Notes in Computer Science, vol 1388. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-64359-1_757
Download citation
DOI: https://doi.org/10.1007/3-540-64359-1_757
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64359-3
Online ISBN: 978-3-540-69756-5
eBook Packages: Springer Book Archive