Abstract
This paper deals with the reliability of task graph schedules with transient and fail-stop failures. While computing the reliability of a given schedule is easy in the absence of task replication, the problem becomes much more difficult when task replication is used. We fill a complexity gap of the scheduling literature: our main result is that this reliability problem is #P′-Complete (hence at least as hard as NP-Complete problems), both for transient and for fail-stop processor failures. We also study the evaluation of a restricted class of schedules, where a task cannot be scheduled before all replicas of all its predecessors have completed their execution. Although the complexity in this case with fail-stop failures remains open, we provide an algorithm to estimate the reliability while limiting evaluation costs, and we validate this approach through simulations.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Awerbuch, B., Azar, Y., Fiat, A., & Leighton, F. T. (1996). Making commitments in the face of uncertainty: How to pick a winner almost every time. In 28th ACM SToC (pp. 519–530).
Bannister, J., & Trivedi, K. S. (1983). Task allocation in fault-tolerant distributed systems. Acta Informatica, 20, 261–281.
Barlow, R. E., & Proschan, F. (1967). Mathematical theory of reliability. New York: Wiley.
Bhatt, S., Chung, F., Leighton, F., & Rosenberg, A. (1997). On optimal strategies for cycle-stealing in networks of workstations. IEEE Transactions on Computers, 46(5), 545–557.
Bodlaender, H. L., & Wolle, T. (2004). A note on the complexity of network reliability problems. IEEE Transactions on Information Theory, 47, 1971–1988.
Bream, B. (1995) Reliability block diagrams and reliability modeling (Tech. rep.), Office of safety and mission assurance, NASA Lewis Research Center.
Brucker, P. (2004). Scheduling algorithms. Berlin: Springer.
Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., & Snir, M. (2009). Toward exascale resilience. The International Journal of High Performance Computing Applications, 23(4), 374–388.
Dongarra, J., Jeannot, E., Saule, E., & Shi, Z. (2007). Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: 19th ACM symp. on parallelism in algo. and archi. (SPAA’07), San Diego, CA, USA.
Girault, A., & Kalla, H. (2009). A novel bicriteria scheduling heuristic providing a guaranteed global system failure rate. IEEE Transactions on Dependable and Secure Computing, 6(4), 241–254.
Girault, A., Saule, E., & Trystram, D. (2009). Reliability versus performance for critical applications. Journal of Parallel and Distributed Computing, 69(3), 326–336.
Jeannot, E., Saule, E., & Trystram, D. (2008). Bi-objective approximation scheme for makespan and reliability optimization on uniform parallel machines. In: The 14th int. Euro-par conf. on parallel and distributed computing, Spain.
Kartik, S., & Murthy, C. S. R. (1997). Task allocation algorithms for maximizing reliability of distributed computing systems. IEEE Transactions on Computers, 46(6), 719–724.
Provan, J. S., & Ball, M. O. (1983). The complexity of counting cuts and of computing the probability that a graph is connected. SIAM Journal on Computing, 12(4), 777–788.
Rosenberg, A. L. (2002). Optimal schedules for cycle-stealing in a network of workstations with a bag-of-tasks workload. IEEE Transactions on Parallel and Distributed Systems, 13(2), 179–191.
Shatz, S., & Wang, J. (1989). Models and algorithms for reliability-oriented task-allocation in redundant distributed-computer systems. IEEE Transactions on Reliability, 38(1), 16–26.
Shatz, S., Wang, J., & Goto, M. (1992). Task allocation for maximizing reliability of distributed computer systems. IEEE Transactions on Computers, 41(9), 1156–1168.
Valiant, L. G. (1979). The complexity of enumeration and reliability problems. SIAM Journal on Computing, 8(3), 410–421.
Zhu, D., Melhem, R., & Mossé, D. (2004). The effects of energy management on reliability in real-time embedded systems. In: International conference on computer aided design, ICCAD’04, San Jose (CA), USA, pp. 35–40.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Benoit, A., Canon, LC., Jeannot, E. et al. Reliability of task graph schedules with transient and fail-stop failures: complexity and algorithms. J Sched 15, 615–627 (2012). https://doi.org/10.1007/s10951-011-0236-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10951-011-0236-y