Abstract
The large scale of current and next-generation massively parallel processing (MPP) systems presents significant challenges related to fault tolerance. For applications that perform periodic checkpointing, the choice of the checkpoint interval, the period between checkpoints, can have a significant impact on the execution time of the application and the number of checkpoint I/O operations performed by the application. These two metrics determine the frequency of checkpoint I/O operations performed by the application and, thereby, the contribution of the checkpoint operations to the demand made by the application on the I/O bandwidth of the computing system. Finding the optimal checkpoint interval that minimizes the wall clock execution time has been a subject of research over the last decade. In this paper, we present a simple, elegant, and accurate analytical model of a complementary performance metric - the aggregate number of checkpoint I/O operations. We present an analytical model of the expected number of checkpoint I/O operations and simulation studies that validate the analytical model. Insights provided by a mathematical analysis of this model, combined with existing models for wall clock execution time, facilitate application programmers in making a well informed choice of checkpoint interval that represents an appropriate trade off between execution time and number of checkpoint I/O operations. We illustrate the existence of such propitious checkpoint intervals using parameters of four MPP systems, SNL’s Red Storm, ORNL’s Jaguar, LLNL’s Blue Gene/L (BG/L), and a theoretical Petaflop system.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Asci purple statement of work, lawrence livermore national laboratory, http://www.llnl.gov/asci/purple/attachment_02_purplesowv09.pdf (accessed: April 23, 2006)
Arunagiri, S., Daly, J.T., Teller, P.J.: Propitious checkpoint intervals to improve system performance. Technical Report UTEP-CS-09-09, University of Texas at El Paso (2009)
Daly, J.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22, 303–312 (2006)
Elnozahy, E.N., Plank, J.S.: Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing 1(2), 97–108 (2004)
Gibson, G., Schroeder, B., Digney, J.: Failure tolerance in petascale computers. CTWatch Quarterly (November 2007)
Kavanaugh, G.P., Sanders, W.H.: Performance analysis of two time-based coordinated checkpointing protocols. In: PRFTS 1997: Proceedings of the 1997 Pacific Rim International Symposium on Fault-Tolerant Systems, Washington, DC, USA, p. 194. IEEE Computer Society, Los Alamitos (1997)
Kim, Y., Plank, J.S., Dongarra, J.J.: Fault tolerant matrix operations for networks of workstations using multiple checkpointing. In: HPC-ASIA 1997: Proceedings of High-Performance Computing on the Information Superhighway, HPC-Asia 1997, Washington, DC, USA, p. 460. IEEE Computer Society, Los Alamitos (1997)
Liang, Y., Sivasubramaniam, A., Moreira, J.: Filtering failure logs for a bluegene/l prototype. In: Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN 2005), June 2005, pp. 476–485 (2005)
Oliner, A.J., Rudolph, L., Sahoo, R.K.: Cooperative checkpointing: a robust approach to large-scale systems reliability. In: ICS 2006: Proceedings of the 20th Annual International Conference on Supercomputing, Cairns, Queensland, Australia, pp. 14–23. ACM Press, New York (2006)
Oliner, A.J., Rudolph, L., Sahoo, R.K.: Cooperative checkpointing theory. In: Proceedings of IPDPS, Intl. Parallel and Distributed Processing Symposium (2006)
Pattabiraman, K., Vick, C., Wood, A.: Modeling coordinated checkpointing for large-scale supercomputers. In: Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN 2005), Washington, DC, pp. 812–821. IEEE Computer Society, Los Alamitos (2005)
Plank, J.S., Elwasif, W.R.: Experimental assessment of workstation failures and their impact on checkpointing systems. In: Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, June 1998, pp. 48–57 (1998)
Sahoo, R.K., Bae, M., Vilalta, R., Moreira, J., Ma, S., Gupta, M.: Providing persistent and consistent resources through event log analysis and predictions for large-scale computing systems. In: SHAMAN Workshop, ICSY 2002 (June 2002)
Sahoo, R.K., Sivasubramaniam, A., Squillante, M.S., Zhang, Y.: Failure data analysis of a large-scale heterogeneous server environment. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN 2004), June 2004, pp. 772–781 (2004)
Subramaniyan, R., Grobelny, E., Studham, S., George, A.D.: Optimization of checkpointing-related i/o for high-performance parallel and distributed computing. J. Supercomput. 46(2), 150–180 (2008)
Subramaniyan, R., Studham, R.S., Grobelny, E.: Optimization of checkpointing-related I/O for high-performance parallel and distributed computing. In: Proceedings of The International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 937–943 (2006)
Vaidya, N.H.: Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Transactions on Computers 46(8), 942–947 (1997)
Young, J.W.: A first order approximation to the optimum checkpoint interval. Communications of the ACM 17(9), 530–531 (1974)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Arunagiri, S., Daly, J.T., Teller, P.J. (2009). Modeling and Analysis of Checkpoint I/O Operations. In: Al-Begain, K., Fiems, D., Horváth, G. (eds) Analytical and Stochastic Modeling Techniques and Applications. ASMTA 2009. Lecture Notes in Computer Science, vol 5513. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02205-0_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-02205-0_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02204-3
Online ISBN: 978-3-642-02205-0
eBook Packages: Computer ScienceComputer Science (R0)