Using Performance Tools to Support Experiments in HPC Resilience

Naughton, Thomas; Böhm, Swen; Engelmann, Christian; Vallée, Geoffroy

doi:10.1007/978-3-642-54420-0_71

Thomas Naughton^27,28,
Swen Böhm²⁷,
Christian Engelmann²⁷ &
…
Geoffroy Vallée²⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8374))

Included in the following conference series:

European Conference on Parallel Processing

1783 Accesses

Abstract

The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environments, specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has also recognized that tools for resilience experimentation are greatly lacking. However, we argue that there are several parallels between “performance tools” and “resilience tools”. As such, we believe the rich set of HPC performance-focused tools can be extended (repurposed) to benefit the resilience community.

In this paper, we describe the initial motivation to leverage standard HPC performance analysis techniques to aid in developing diagnostic tools to assist fault tolerance experiments for HPC applications. These diagnosis procedures help to provide context for the system when the errors (failures) occurred. We describe our initial work in leveraging an MPI performance trace tool to assist in providing global context during fault injection experiments. Such tools will assist the HPC resilience community as they extend existing and new application codes to support fault tolerance.

The rights of this work are transferred to the extent transferable according to title 17 U.S.C. 105.

Download to read the full chapter text

Chapter PDF

Towards High Performance Resilience Using Performance Portable Abstractions

Fault Tolerance Techniques for High-Performance Computing

Software approaches for resilience of high performance computing systems: a survey

Article 12 December 2022

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Ahn, D.H., de Supinski, B.R., Laguna, I., Lee, G.L., Liblit, B., Miller, B.P., Schulz, M.: Scalable temporal order analysis for large scale debugging. In: Proceedings of the ACM/IEEE Conference on High Performance Computing (SC). ACM (2009)
Google Scholar
Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J.: An evaluation of user-level failure mitigation support in MPI. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) EuroMPI 2012. LNCS, vol. 7490, pp. 193–203. Springer, Heidelberg (2012)
Chapter Google Scholar
Böhm, S., Engelmann, C.: xSim: The extreme-scale simulator. In: Proceedings of the International Conference on High Performance Computing and Simulation (HPCS), Istanbul, Turkey, July 4-8, pp. 280–286. IEEE Computer Society, Los Alamitos (2011)
Google Scholar
Daly, J., Harrod, B., Hoang, T., Nowell, L., Adolf, B., Borkar, S., DeBardeleben, N., Elnozahy, M., Heroux, M., Rogers, D., Ross, R., Sarkar, V., Schulz, M., Snir, M., Woodward, P., Aulwes, R., Bancroft, M., Bronevetsky, G., Carlson, B., Geist, A., Hall, M., Hollingsworth, J., Lucas, B., Lumsdaine, A., Macaluso, T., Quinlan, D., Sachs, S., Shalf, J., Smith, T., Stearley, J., Still, B., Wu, J.: Inter-Agency Workshop on HPC Resilience at Extreme Scale (February 2012)
Google Scholar
DeBardeleben, N., Laros, J., Daly, J.T., Scott, S.L., Engelmann, C., Harrod, B.: High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Whitepaper (December 2009)
Google Scholar
Dongarra, J., Beckman, P., et al.: The international exascale software roadmap. International Journal of High Performance Computer Applications 25(1) (2011)
Google Scholar
Hursey, J., January, C., O’Connor, M., Hargrove, P.H., Lecomber, D., Squyres, J.M., Lumsdaine, A.: Checkpoint/Restart-enabled parallel debugging. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 219–228. Springer, Heidelberg (2010)
Chapter Google Scholar
Janssen, C.L., Adalsteinsson, H., Cranford, S., Kenny, J.P., Pinar, A., Evensky, D.A., Mayo, J.: A simulator for large-scale parallel computer architectures. International Journal of Parallel and Distributed System Technology 1(2), 57–73 (2010)
Article Google Scholar
Laguna, I., Gamblin, T., de Supinski, B.R., Bagchi, S., Bronevetsky, G., Anh, D.H., Schulz, M., Rountree, B.: Large scale debugging of parallel tasks with automaded. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 50:1–50:10. ACM, New York (2011)
Google Scholar
Li, D., Vetter, J.S., Yu, W.: Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM (November 2012)
Google Scholar
Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P., DeBardeleben, N.A., Diniz, P., Engelmann, C., Erez, M., Fazzari, S., Geist, A., Gupta, R., Johnson, F., Krishnamoorthy, S., Leyffer, S., Liberty, D., Mitra, S., Munson, T., Schreiber, R., Stearley, J., Hensbergen, E.V.: Workshop report: Addressing failures in exascale computing (April 2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA
Thomas Naughton, Swen Böhm, Christian Engelmann & Geoffroy Vallée
School of Systems Engineering, The University of Reading, Reading, UK
Thomas Naughton

Authors

Thomas Naughton
View author publications
You can also search for this author in PubMed Google Scholar
Swen Böhm
View author publications
You can also search for this author in PubMed Google Scholar
Christian Engelmann
View author publications
You can also search for this author in PubMed Google Scholar
Geoffroy Vallée
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Rechen- und Kommunikationszentrum, RWTH Aachen, Seffenter Weg 23, 52074, Aachen, Germany
Dieter an Mey
TU Vienna, 1040, Vienna, Austria
Michael Alexander
RWTH Aachen University, Seffenter Weg 23, 52074, Aachen, Germany
Paolo Bientinesi & Carsten Clauss &
University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy
Mario Cannataro
Inria Rennes - Bretagne Atlantique, 35042, Rennes, France
Alexandru Costan & Christine Morin &
University of Innsbruck, 6020, Innsbruck, Austria
Gabor Kecskemeti
Department of Computer Science, University of Pisa, 56126, Pisa, Italy
Laura Ricci
Universitat Politècnica de València, 46022, València, Spain
Julio Sahuquillo
LLNL, USA
Martin Schulz
Dipartimento di Informatica, Università di Salerno, 84084, Salerno, Italy
Vittorio Scarano
Tennessee Tech University and Oak Ridge National Laboratory, 38505, Cookeville, TN, USA
Stephen L. Scott
Technische Universität München, 80333, Munich, Germany
Josef Weidendorfer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Naughton, T., Böhm, S., Engelmann, C., Vallée, G. (2014). Using Performance Tools to Support Experiments in HPC Resilience. In: an Mey, D., et al. Euro-Par 2013: Parallel Processing Workshops. Euro-Par 2013. Lecture Notes in Computer Science, vol 8374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54420-0_71

Download citation

DOI: https://doi.org/10.1007/978-3-642-54420-0_71
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54419-4
Online ISBN: 978-3-642-54420-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Using Performance Tools to Support Experiments in HPC Resilience

Abstract

Chapter PDF

Similar content being viewed by others

Towards High Performance Resilience Using Performance Portable Abstractions

Fault Tolerance Techniques for High-Performance Computing

Software approaches for resilience of high performance computing systems: a survey

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Using Performance Tools to Support Experiments in HPC Resilience

Abstract

Chapter PDF

Similar content being viewed by others

Towards High Performance Resilience Using Performance Portable Abstractions

Fault Tolerance Techniques for High-Performance Computing

Software approaches for resilience of high performance computing systems: a survey

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation