Abstract
Future Exascale computing systems with a large number of processors, memory elements and interconnection links, are expected to experience multiple, complex faults, which affect both applications and operating-runtime systems. A variety of algorithms, frameworks and tools are being proposed to realize and/or verify the resilience properties of computations that guarantee correct results on failure-prone computing systems. We analytically show that certain resilient computation problems in presence of general classes of faults are undecidable, that is, no algorithms exist for solving them. We first show that the membership verification in a generic set of resilient computations is undecidable. We describe classes of faults that can create infinite loops or non-halting computations, whose detection in general is undecidable. We then show certain resilient computation problems to be undecidable by using reductions from the loop detection and halting problems under two formulations, namely, an abstract programming language and Turing machines, respectively. These two reductions highlight different failure effects: the former represents program and data corruption, and the latter illustrates incorrect program execution. These results call for broad-based, well-characterized resilience approaches that complement purely computational solutions using methods such as hardware monitors, co-designs, and system- and application-specific diagnosis codes.
Chapter PDF
Similar content being viewed by others
References
Cappello, F.: Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. Journal of High Performance Computing Applications 23(3), 212–226 (2009)
Cappello, F., Geist, A., Gropp, B., Kale, S., Kramer, B., Snir, M.: Towards exascale resilience. Journal of High Performance Computing Applications 23(4), 374–388 (2011)
Carbin, M., Misailovic, S., Rinard, M.C.: Verifying quantitaive realiability for programs that execute on unreliable hardware. In: Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA (2013)
Chaitin, G.J.: Information, Randomness and Incompleteness, 2nd edn. World Scientific Pub. (1990)
Chen, Z.: Online-abft: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (2013)
Cohen, D.I.A.: Inroduction to Computer Theory. John Wiley and Sons, Inc (1986)
Cohen, F.B.: Computational aspects of computer virus. Computer & Security 8, 325–344 (1989)
Davies, M.D., Weyuker, E.J.: Computability, Complexity, and Languages. Academic Press, Inc. (1983)
Davies, T., Chen, X.: Correcting soft errors online in lu factorization. In: Symposium on High-Performance Parallel and Distributed Computing (2013)
de Kruijif, M., Nomura, S., Sankaralingam, K.: Relax: An architectural framework for software recovery of hardware faults. In: International Symposium on Computer Architecture, ISCA (2010)
Dongarra, J.: P Beckman, and et al. The international exascale software roadmap. International Journal of High Performance Computer Applications 25(1) (2011)
Erez, M., Jayasena, N., Knight, T.J., Dally, W.J.: Fault tolerance techniques for the merrimac streaming supercomputer. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2005)
Fujiwara, H., Toida, S.: The complexity of fault detection problems for combinational logic circuits. IEEE Trans. on Computers C-31(6), 553–560 (1982)
Godel, K.: On formally undecidable propositions of principia mathematica and related systems i. Monatshefte fur Math. und Physik 38, 173–198 (1992); Englishe translation by Meltzer, B., published by Dover Publications, Inc. (1992)
HPL - a portable implementation of the high-performance linpack benchmark for distributed-memory computers, http://www.netlib.org/benchmark/hpl
Huang, Y., Kintala, C.: Software fault tolerance of the application layer. In: Lyu, M.R. (ed.) Software Fault Tolerance, pp. 231–248 (1995)
Jia, Y., Luszczek, P., Bosilca, G., Dongarra, J.: Cpu-gpu hybrid bidiagonal reduction with soft error resilience. In: Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA (2013)
Li, D., Chen, Z., Wu, P., Vetter, J.S.: Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach. In: ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2013)
Li, M., Ramachandran, P., Sahoo, S.K., Adve, S.V., Adve, V.S., Zhou, Y.: Understanding the propagation of hard errors to software and implications for resilient system design. In: Architectural Support for Programming Languages and Operating Systems, ASPLOS (2008)
Lu, C., Reed, D.A.: Assessing fault sensitivity in mpi applications. In: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (2004)
Rao, N.S.V.: Fault detection in multi-core processors using chaotic maps. In: 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS 2013 (2013)
Sahoo, S.K., Li, M.-L., Ramachandran, P., Adve, S.V., Adve, V.S., Zhou, Y.: Using likely program invariants to detect hardware errors. In: International Conf. on Dependable Systems and Networks (2008)
Turing, A.N.: On computable numbers, with an application to the Entscheidungsproblem. Proc. London Math. Society 42(3,4), 230–265 (1936)
Uspensky, V.A.: Godel’s Incompleteness Theorem. Mir Publsihers, English translation (1987)
Vetter, J.S. (ed.): Contemporary High Performance Computing: From Petascale toward Exascale. Chapman and Hall (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Rao, N.S.V. (2014). On Undecidability Aspects of Resilient Computations and Implications to Exascale. In: Lopes, L., et al. Euro-Par 2014: Parallel Processing Workshops. Euro-Par 2014. Lecture Notes in Computer Science, vol 8805. Springer, Cham. https://doi.org/10.1007/978-3-319-14325-5_44
Download citation
DOI: https://doi.org/10.1007/978-3-319-14325-5_44
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14324-8
Online ISBN: 978-3-319-14325-5
eBook Packages: Computer ScienceComputer Science (R0)