Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

Subramaniyan, Rajagopal; Grobelny, Eric; Studham, Scott; George, Alan D.

doi:10.1007/s11227-007-0162-0

Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

Published: 15 December 2007

Volume 46, pages 150–180, (2008)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

The Journal of Supercomputing Aims and scope Submit manuscript

Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

Download PDF

Rajagopal Subramaniyan¹,
Eric Grobelny¹,
Scott Studham² &
…
Alan D. George¹

91 Accesses
6 Citations
Explore all metrics

Abstract

Checkpointing, the process of saving program/application state, usually to a stable storage, has been the most common fault-tolerance methodology for high-performance applications. The rate of checkpointing (how often) is primarily driven by the failure rate of the system. If the checkpointing rate is low, fewer resources are consumed but the chance of high computational loss is increased and vice versa if the checkpointing rate is high. It is important to strike a balance, and an optimum rate of checkpointing is required. In this paper, we analytically model the process of checkpointing in terms of mean-time-between-failure of the system, amount of memory being checkpointed, sustainable I/O bandwidth to the stable storage, and frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on systems with given specifications thereby making way for efficient use of available resources and maximum performance of the system without compromising on the fault-tolerance aspects. Further, we develop discrete-event models simulating the checkpointing process to verify the analytical model for optimum checkpointing. Using the analytical model, we also investigate the optimum rate of checkpointing for systems of varying resource levels ranging from small embedded cluster systems to large supercomputers.

Article PDF

Optimal Checkpointing Period: Time vs. Energy

A model of checkpoint behavior for applications that have I/O

Article Open access 17 April 2022

Analysis of parallel application checkpoint storage for system configuration

Article 16 October 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

73.4 GB 3.6MS/15000 (ULTRA 320 80PIN) 8192K 3.5″/HH, http://www.spartantech.com/product.asp?PID=ST373453LC&m1=pg (accessed: April 23, 2006)
ASCI purple statement of work, Lawrence Livermore National Laboratory, http://www.llnl.gov/asci/purple/Attachment_02_PurpleSOWV09.pdf (accessed: April 23, 2006)
Cheetah 15K.3-ST336753LC, http://www.seagate.com/cda/products/discsales/marketing/detail/0,1081,552,00.html (accessed: April 23, 2006)
Cramming more components onto integrated circuits. Electronics 37(8), April 19, 1965
Dongarra J, Luszczek P, Petitet A (2003) The LINPACK benchmark: past, present, and future. Concurr Comput Pract Experience 15:1–18
Article Google Scholar
Fixed point iteration, http://pathfinder.scar.utoronto.ca/~dyer/csca57/book_P/node34.html (accessed July 3, 2006)
HITACHI eyes 1 TB desktop drives, http://www.pcworld.com/news/article/0,aid,120279,00.asp (accessed: April 23, 2006)
Kavanaugh GP, Sanders WH (1997) Performance analysis of two time-based coordinated checkpointing protocols. In: Pacific Rim international symposium on fault-tolerant systems, Taipei, Taiwan, December 15–16, 1997
LINPACK, http://www.netlib.org/linpack/ (accessed: April 23, 2006)
Los Alamos/Liv 3D simulations, Publication of Los Alamos National Laboratory, vol 3, No 6, April 4, 2002
Plank J, Beck M, Kingsley G, Li K (1995) Libckpt: transparent checkpointing under Unix. In: Usenix winter 1995 technical conference, New Orleans, LA, January, 1995
Plank JS, Kim Y, Dongarra J (1997) Fault tolerant matrix operations for networks of workstations using diskless checkpointing. J Parallel Distributed Comput 43(2):125–138
Article Google Scholar
Schocht G, Troxel I, Farhangian K, Unger P, Zinn D, Mick C, George A, Salzwedel H (2003) System-level simulation modeling with MLDesigner. In: 11th IEEE/ACM international symposium on modeling, analysis, and simulation of computer and telecommunication systems (MASCOTS), Orlando, FL, October 2003
Seagate Barracuda 7200.8 400 GB 3.5″ IDE Ultra ATA100 Hard Drive–OEM, http://www.newegg.com/Product/Product.asp?Item=N82E16822148060 (accessed: April 23, 2006)
Stanat DF, Weiss SF (2006) Systematic programming. Online book resources, http://www.cs.unc.edu/~weiss/COMP114/BOOK/BookChapters.html (accessed: June 1, 2006)
Thain D, Tannenbaum T, Livny M (2005) Distributed computing in practice: the Condor experience. Concurr Comput Pract Experience 17(2–4):323–356
Article Google Scholar
Top 500 supercomputer sites, http://www.top500.org/ (accessed: April 23, 2006)
Vaidya NH (1995) A case for two-level distributed recovery schemes. In: ACM SIGMETRICS conference on measurement and modeling of computer systems, Ottawa, May 1995
Vaidya NH (1997) Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans Comput 46(8):942–947
Article Google Scholar
Wong KF, Franklin M (1996) Checkpointing in distributed systems. J Parallel Distributed Syst 35(1):67–75
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

High-performance Computing and Simulation (HCS) Research Laboratory, Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, 32611-6200, USA
Rajagopal Subramaniyan, Eric Grobelny & Alan D. George
National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN, 37831-6006, USA
Scott Studham

Authors

Rajagopal Subramaniyan
View author publications
You can also search for this author in PubMed Google Scholar
Eric Grobelny
View author publications
You can also search for this author in PubMed Google Scholar
Scott Studham
View author publications
You can also search for this author in PubMed Google Scholar
Alan D. George
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajagopal Subramaniyan.

Additional information

An earlier version of this paper appeared in Proceedings of the 2006 International Conference on Parallel and Distributed Processing Techniques and Applications, June 2006.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Subramaniyan, R., Grobelny, E., Studham, S. et al. Optimization of checkpointing-related I/O for high-performance parallel and distributed computing. J Supercomput 46, 150–180 (2008). https://doi.org/10.1007/s11227-007-0162-0

Download citation

Received: 25 September 2007
Accepted: 21 November 2007
Published: 15 December 2007
Issue Date: November 2008
DOI: https://doi.org/10.1007/s11227-007-0162-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

Abstract

Article PDF

Similar content being viewed by others

Optimal Checkpointing Period: Time vs. Energy

A model of checkpoint behavior for applications that have I/O

Analysis of parallel application checkpoint storage for system configuration

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

Abstract

Article PDF

Similar content being viewed by others

Optimal Checkpointing Period: Time vs. Energy

A model of checkpoint behavior for applications that have I/O

Analysis of parallel application checkpoint storage for system configuration

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation