Abstract
The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environments, specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has also recognized that tools for resilience experimentation are greatly lacking. However, we argue that there are several parallels between “performance tools” and “resilience tools”. As such, we believe the rich set of HPC performance-focused tools can be extended (repurposed) to benefit the resilience community.
In this paper, we describe the initial motivation to leverage standard HPC performance analysis techniques to aid in developing diagnostic tools to assist fault tolerance experiments for HPC applications. These diagnosis procedures help to provide context for the system when the errors (failures) occurred. We describe our initial work in leveraging an MPI performance trace tool to assist in providing global context during fault injection experiments. Such tools will assist the HPC resilience community as they extend existing and new application codes to support fault tolerance.
The rights of this work are transferred to the extent transferable according to title 17 U.S.C. 105.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Ahn, D.H., de Supinski, B.R., Laguna, I., Lee, G.L., Liblit, B., Miller, B.P., Schulz, M.: Scalable temporal order analysis for large scale debugging. In: Proceedings of the ACM/IEEE Conference on High Performance Computing (SC). ACM (2009)
Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J.: An evaluation of user-level failure mitigation support in MPI. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) EuroMPI 2012. LNCS, vol. 7490, pp. 193–203. Springer, Heidelberg (2012)
Böhm, S., Engelmann, C.: xSim: The extreme-scale simulator. In: Proceedings of the International Conference on High Performance Computing and Simulation (HPCS), Istanbul, Turkey, July 4-8, pp. 280–286. IEEE Computer Society, Los Alamitos (2011)
Daly, J., Harrod, B., Hoang, T., Nowell, L., Adolf, B., Borkar, S., DeBardeleben, N., Elnozahy, M., Heroux, M., Rogers, D., Ross, R., Sarkar, V., Schulz, M., Snir, M., Woodward, P., Aulwes, R., Bancroft, M., Bronevetsky, G., Carlson, B., Geist, A., Hall, M., Hollingsworth, J., Lucas, B., Lumsdaine, A., Macaluso, T., Quinlan, D., Sachs, S., Shalf, J., Smith, T., Stearley, J., Still, B., Wu, J.: Inter-Agency Workshop on HPC Resilience at Extreme Scale (February 2012)
DeBardeleben, N., Laros, J., Daly, J.T., Scott, S.L., Engelmann, C., Harrod, B.: High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Whitepaper (December 2009)
Dongarra, J., Beckman, P., et al.: The international exascale software roadmap. International Journal of High Performance Computer Applications 25(1) (2011)
Hursey, J., January, C., O’Connor, M., Hargrove, P.H., Lecomber, D., Squyres, J.M., Lumsdaine, A.: Checkpoint/Restart-enabled parallel debugging. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 219–228. Springer, Heidelberg (2010)
Janssen, C.L., Adalsteinsson, H., Cranford, S., Kenny, J.P., Pinar, A., Evensky, D.A., Mayo, J.: A simulator for large-scale parallel computer architectures. International Journal of Parallel and Distributed System Technology 1(2), 57–73 (2010)
Laguna, I., Gamblin, T., de Supinski, B.R., Bagchi, S., Bronevetsky, G., Anh, D.H., Schulz, M., Rountree, B.: Large scale debugging of parallel tasks with automaded. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 50:1–50:10. ACM, New York (2011)
Li, D., Vetter, J.S., Yu, W.: Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM (November 2012)
Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P., DeBardeleben, N.A., Diniz, P., Engelmann, C., Erez, M., Fazzari, S., Geist, A., Gupta, R., Johnson, F., Krishnamoorthy, S., Leyffer, S., Liberty, D., Mitra, S., Munson, T., Schreiber, R., Stearley, J., Hensbergen, E.V.: Workshop report: Addressing failures in exascale computing (April 2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Naughton, T., Böhm, S., Engelmann, C., Vallée, G. (2014). Using Performance Tools to Support Experiments in HPC Resilience. In: an Mey, D., et al. Euro-Par 2013: Parallel Processing Workshops. Euro-Par 2013. Lecture Notes in Computer Science, vol 8374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54420-0_71
Download citation
DOI: https://doi.org/10.1007/978-3-642-54420-0_71
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54419-4
Online ISBN: 978-3-642-54420-0
eBook Packages: Computer ScienceComputer Science (R0)