An architecture for rapid distributed fault tolerance

Russ, Samuel H.

doi:10.1007/3-540-64359-1_757

Samuel H. Russ¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1388))

Included in the following conference series:

International Parallel Processing Symposium

110 Accesses
1 Citations

Abstract

Embedded high performance computing is being called upon to provide critical computing resources with increasing frequency. The ability to tolerate faults during operation, both maintaining operational capability and ensuring that correct results continue to be produced, is an important ingredient in mission-critical systems. An architecture for such a system is proposed, providing the ability to withstand faults with graceful degradation in performance and complete transparency to the applications programmer. The final system will be able to offer fault-tolerant computing transparently to MPI applications and draws heavily on existing, demonstrated successes.

This work was funded in part by NSF Grant No. EEC-8907070 Amendment 021 and by ONR Grant No. N00014-97-1-0116.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing

Using Replication for Resilience on Exascale Systems

Fault Tolerance Techniques for High-Performance Computing

Bibliography

Samuel H. Russ, Brian Flachs, Jonathan Robinson, and Bjorn Heckel, “Hector: Automated Task Allocation for MPI”, Proceedings of the 10th International Parallel Processing Symposium, Honolulu, HI, April 1996.
Google Scholar
Guerraoui, R., and Schiper, A., “Software-based Replication for Fault Tolerance”, Computer, Vol. 30, No. 4, April 1997, pp. 68–74.
Article Google Scholar
Jonathan Robinson, Samuel H. Russ, Brian Flachs, and Bjorn Heckel, “A Task Migration Implementation for the Message-Passing Interface”, Proceedings of the IEEE 5th High Performance Distributed Computing Conference (HPDC-5), Syracuse, NY, August 1996.
Google Scholar
Dr. Samuel H. Russ, “Using Hector in an Architecture for Rapid Distributed Fault Tolerance”, MSU Technical Report No. MSSU-EIRS-ERC-97-17, December 1997.
Google Scholar
Dr. Samuel H. Russ, Brad Meyers, Chun-Heong Tan, and Bjorn Heckel, “UserTransparent Run-time Performance Optimization”, 2nd International Workshop on Embedded High Performance Computing, associated with IPPS '97, Geneva, Switzerland, April 1997.
Google Scholar

Download references

Author information

Authors and Affiliations

NSF Engineering Research Center for Computational Field Simulation, Mississippi State University, USA
Dr. Samuel H. Russ

Authors

Dr. Samuel H. Russ
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

José Rolim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Russ, S.H. (1998). An architecture for rapid distributed fault tolerance. In: Rolim, J. (eds) Parallel and Distributed Processing. IPPS 1998. Lecture Notes in Computer Science, vol 1388. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-64359-1_757

Download citation

DOI: https://doi.org/10.1007/3-540-64359-1_757
Published: 08 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64359-3
Online ISBN: 978-3-540-69756-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

An architecture for rapid distributed fault tolerance

Abstract

Access this chapter

Preview

Similar content being viewed by others

FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing

Using Replication for Resilience on Exascale Systems

Fault Tolerance Techniques for High-Performance Computing

Bibliography

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

An architecture for rapid distributed fault tolerance

Abstract

Access this chapter

Preview

Similar content being viewed by others

FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing

Using Replication for Resilience on Exascale Systems

Fault Tolerance Techniques for High-Performance Computing

Bibliography

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation