Abstract
Higher transistor counts, lower voltage levels, and reduced noise margin increase the susceptibility of multicore processors to transient faults. Redundant hardware modules can detect such faults, but software techniques are more appealing for their low cost and flexibility. Recent software proposals have not achieved widespread acceptance because they either increase register pressure, double memory usage, or are too slow in the absence of hardware extensions. This paper presents DAFT, a fast, safe, and memory efficient transient fault detection framework for commodity multicore systems. DAFT replicates computation across multiple cores and schedules fault detection off the critical path. Where possible, values are speculated to be correct and only communicated to the redundant thread at essential program points. DAFT is implemented in the LLVM compiler framework and evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a commodity multicore system. Evaluation results demonstrate that speculation allows DAFT to improves the performance of software redundant multithreading by 2.17× with no degradation of fault coverage.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Hareland, S., Maiz, J., Alavi, M., Mistry, K., Walsta, S., Dai, C.: Impact of CMOS Scaling and SOI on Software Error Rates of Logic Processes. VLSI Technology Digest of Technical Papers (2001)
Baumann R.C.: Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Trans. Device Mater. Reliab. 1(1), 17–22 (2001)
O’Gorman T.J., Ross J.M., Taber A.H., Ziegler J.F., Muhlfeld H.P., Montrose I.C.J., Curtis H.W., Walsh J.L.: Field testing for cosmic ray soft errors in semiconductor memories. IBM J. Res. Dev. 40, 41–49 (1996)
Reis, G.A., Chang, J., August, D.I., Cohn, R., Mukherjee, S.S.: Configurable transient fault detection via dynamic binary translation. In: Proceedings of the 2nd Workshop on Architectural Reliability (2006)
Segura J., Hawkins C.F.: CMOS Electronics: How It Works, How It Fails. Wiley-IEEE Press, New York (2004)
Baumann, R.C.: Soft errors in commercial semiconductor technology: overview and scaling trends. In: IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pp. 121_01.1–121_01.14 (2002)
Michalak S.E., Harris K.W., Hengartner N.W., Takala B.E., Wender S.A.: Predicting the number of fatal soft errors in Los Alamos national labratory’s ASC Q computer. IEEE Trans. Device Mater. Reliab. 5(3), 329–335 (2005)
Mahmood A., McCluskey E.J.: Concurrent error detection using watchdog processors—a survey. IEEE Trans. Comput. 37(2), 160–174 (1988)
Slegel T.J., Averill R.M. III, Check M.A., Giamei B.C., Krumm B.W., Krygowski C.A., Li W.H., Liptay J.S., MacDougall J.D., McPherson T.J., Navarro J.A., Schwarz E.M., Shum K., Webb C.F.: IBM’s S/390 G5 microprocessor design. IEEE Micro 19, 12–23 (1999)
Yeh Y.: Triple-triple redundant 777 primary flight computer. Proc. IEEE Aeros. Appl. Conf. 1, 293–307 (1996)
Yeh, Y.: Design considerations in Boeing 777 fly-by-wire computers. In: Proceedings of the Third IEEE International High-Assurance Systems Engineering Symposium, pp. 64–72 (November 1998)
Horst, R.W., Harris, R.L., Jardine, R.L.: Multiple instruction issue in the nonstop cyclone processor. In: Proceedings of the 17th International Symposium on Computer Architecture, pp. 216–226 (May 1990)
Ando, H., Yoshida, Y., Inoue, A., Sugiyama, I., Asakawa, T., Morita, K., Muta, T., Motokurumada, T., Okada, S., Yamashita, H., Satsukawa, Y., Konmoto, A., Yamashita, R., Sugiyama, H.: A 1.3GHz Fifth Generation SPARC64 Microprocessor. International Solid-State Circuits Conference (2003)
Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 25–36, ACM Press (2000)
Wang, C., Kim, H.-S., Wu, Y., Ying, V.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: CGO ’07: Proceedings of the International Symposium on Code Generation and Optimization, pp. 244–258, IEEE Computer Society, Washington, DC, USA (2007)
Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I.: SWIFT: software implemented fault tolerance. In: Proceedings of the 3rd International Symposium on Code Generation and Optimization (March 2005)
Shye, A., Moseley, T., Reddi, V.J., Blomstedt, J., Connors, D.A.: Using process-level redundancy to exploit multiple cores for transient fault tolerance. In: International Conference on Dependable Systems and Networks, IEEE Computer Society, Los Alamitos, CA, USA (2007)
Rotenberg, E.: AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In: Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, p. 84, IEEE Computer Society (1999)
Mukherjee S.S., Kontz M., Reinhardt S.K.: Detailed design and evaluation of redundant multithreading alternatives. SIGARCH Comput. Archit. News 30(2), 99–110 (2002)
Weaver, C., Emer, J., Mukherjee, S.S., Reinhardt, S.K.: Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor. In: Proceedings of the 31st Annual International Symposium on Computer Architecture (2004)
Vijaykumar, T.N., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: The 29th Annual International Symposium on Computer Architecture, pp. 87–98, IEEE Computer Society (2002)
Oh N., Shirvani P.P., McCluskey E.J.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. 51, 63–75 (2002)
Gomaa, M., Scarbrough, C., Vijaykumar, T.N., Pomeranz, I.: Transient-fault recovery for chip multiprocessors. In: Proceedings of the 30th annual international symposium on Computer architecture, pp. 98–109. ACM Press (2003)
Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I., Mukherjee, S.S.: Design and evaluation of hybrid fault-detection systems. In: Proceedings of the 32th Annual International Symposium on Computer Architecture, pp. 148–159 (June 2005)
Avizienis A.: The N-version approach to fault-tolerant software. IEEE Trans. Softw. Eng. 11, 1491–1501 (1985)
Berger, E.D., Zorn, B.G.: DieHard: probabilistic memory safety for unsafe languages. In: Proceedings of the ACM SIGPLAN ’06 Conference on Programming Language Design and Implementation (June 2006)
Brilliant S.S., Knight J.C., Leveson N.G.: Analysis of faults in an N-version software experiment. IEEE Trans. Softw. Eng. 16(2), 238–247 (1990)
Novark, G., Berger, E.D., Zorn, B.G.: Exterminator: automatically correcting memory errors with high probability. In: PLDI ’07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pp. 1–11. ACM, New York, NY, USA (2007)
James, W.D., Jr, J.E.L.: A user-level checkpointing library for POSIX threads programs. In: The Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (1999)
Whisnant, K., Kalbarczyk, Z., Iyer, R.K.: Micro-checkpointing: checkpointing for multithreaded applications. In: Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW), IEEE Computer Society, Washington, DC, USA (2000)
Rieker, M., Ansel, J.: Transparent user-level checkpointing for the native POSIX thread library for Linux. In: International Conference on Parallel and Distributed Processing Techniques and Applications (2006)
Vachharajani, N., Rangan, R., Raman, E., Bridges, M.J., Ottoni, G., August, D.I.: Speculative Decoupled Software Pipelining. In: PACT ’07: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pp. 49–59. IEEE Computer Society, Washington, DC, USA (2007)
ISO/IEC 9899-1999 Programming Languages – C, Second Edition (1999)
Jablin, T.B., Zhang, Y., Jablin, J.A., Huang, J., Kim, H., August, D.I.: Liberty queues for EPIC architectures. In: Proceedings of the 8th Workshop on Explicitly Parallel Instruction Computing Techniques (April 2010)
Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program analysis & transformation. In: CGO ’04: Proceedings of the International Symposium on Code Generation and Optimization, p. 75. IEEE Computer Society, Washington, DC, USA (2004)
Ferrante J., Ottenstein K.J., Warren J.D.: The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 319–349 (1987)
Ottoni, G., Rangan, R., Stoler, A., August, D.I.: Automatic thread extraction with decoupled software pipelining. In: MICRO ’05: Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 105–118, IEEE Computer Society, Washington, DC, USA (2005)
Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’05, pp. 190–200. ACM, New York, NY, USA (2005)
Walker D., Mackey L., Ligatti J., Reis G.A., August D.I.: Static typing for a faulty lambda calculus. SIGPLAN Not. 41(9), 38–49 (2006)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, Y., Lee, J.W., Johnson, N.P. et al. DAFT: Decoupled Acyclic Fault Tolerance. Int J Parallel Prog 40, 118–140 (2012). https://doi.org/10.1007/s10766-011-0183-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-011-0183-4