Abstract
Checkpointing, the process of saving program/application state, usually to a stable storage, has been the most common fault-tolerance methodology for high-performance applications. The rate of checkpointing (how often) is primarily driven by the failure rate of the system. If the checkpointing rate is low, fewer resources are consumed but the chance of high computational loss is increased and vice versa if the checkpointing rate is high. It is important to strike a balance, and an optimum rate of checkpointing is required. In this paper, we analytically model the process of checkpointing in terms of mean-time-between-failure of the system, amount of memory being checkpointed, sustainable I/O bandwidth to the stable storage, and frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on systems with given specifications thereby making way for efficient use of available resources and maximum performance of the system without compromising on the fault-tolerance aspects. Further, we develop discrete-event models simulating the checkpointing process to verify the analytical model for optimum checkpointing. Using the analytical model, we also investigate the optimum rate of checkpointing for systems of varying resource levels ranging from small embedded cluster systems to large supercomputers.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
73.4 GB 3.6MS/15000 (ULTRA 320 80PIN) 8192K 3.5″/HH, http://www.spartantech.com/product.asp?PID=ST373453LC&m1=pg (accessed: April 23, 2006)
ASCI purple statement of work, Lawrence Livermore National Laboratory, http://www.llnl.gov/asci/purple/Attachment_02_PurpleSOWV09.pdf (accessed: April 23, 2006)
Cheetah 15K.3-ST336753LC, http://www.seagate.com/cda/products/discsales/marketing/detail/0,1081,552,00.html (accessed: April 23, 2006)
Cramming more components onto integrated circuits. Electronics 37(8), April 19, 1965
Dongarra J, Luszczek P, Petitet A (2003) The LINPACK benchmark: past, present, and future. Concurr Comput Pract Experience 15:1–18
Fixed point iteration, http://pathfinder.scar.utoronto.ca/~dyer/csca57/book_P/node34.html (accessed July 3, 2006)
HITACHI eyes 1 TB desktop drives, http://www.pcworld.com/news/article/0,aid,120279,00.asp (accessed: April 23, 2006)
Kavanaugh GP, Sanders WH (1997) Performance analysis of two time-based coordinated checkpointing protocols. In: Pacific Rim international symposium on fault-tolerant systems, Taipei, Taiwan, December 15–16, 1997
LINPACK, http://www.netlib.org/linpack/ (accessed: April 23, 2006)
Los Alamos/Liv 3D simulations, Publication of Los Alamos National Laboratory, vol 3, No 6, April 4, 2002
Plank J, Beck M, Kingsley G, Li K (1995) Libckpt: transparent checkpointing under Unix. In: Usenix winter 1995 technical conference, New Orleans, LA, January, 1995
Plank JS, Kim Y, Dongarra J (1997) Fault tolerant matrix operations for networks of workstations using diskless checkpointing. J Parallel Distributed Comput 43(2):125–138
Schocht G, Troxel I, Farhangian K, Unger P, Zinn D, Mick C, George A, Salzwedel H (2003) System-level simulation modeling with MLDesigner. In: 11th IEEE/ACM international symposium on modeling, analysis, and simulation of computer and telecommunication systems (MASCOTS), Orlando, FL, October 2003
Seagate Barracuda 7200.8 400 GB 3.5″ IDE Ultra ATA100 Hard Drive–OEM, http://www.newegg.com/Product/Product.asp?Item=N82E16822148060 (accessed: April 23, 2006)
Stanat DF, Weiss SF (2006) Systematic programming. Online book resources, http://www.cs.unc.edu/~weiss/COMP114/BOOK/BookChapters.html (accessed: June 1, 2006)
Thain D, Tannenbaum T, Livny M (2005) Distributed computing in practice: the Condor experience. Concurr Comput Pract Experience 17(2–4):323–356
Top 500 supercomputer sites, http://www.top500.org/ (accessed: April 23, 2006)
Vaidya NH (1995) A case for two-level distributed recovery schemes. In: ACM SIGMETRICS conference on measurement and modeling of computer systems, Ottawa, May 1995
Vaidya NH (1997) Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans Comput 46(8):942–947
Wong KF, Franklin M (1996) Checkpointing in distributed systems. J Parallel Distributed Syst 35(1):67–75
Author information
Authors and Affiliations
Corresponding author
Additional information
An earlier version of this paper appeared in Proceedings of the 2006 International Conference on Parallel and Distributed Processing Techniques and Applications, June 2006.
Rights and permissions
About this article
Cite this article
Subramaniyan, R., Grobelny, E., Studham, S. et al. Optimization of checkpointing-related I/O for high-performance parallel and distributed computing. J Supercomput 46, 150–180 (2008). https://doi.org/10.1007/s11227-007-0162-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-007-0162-0