Abstract
As the scale and complexity of heterogeneous computing systems grow, failures occur frequently and have an adverse effect on solving large-scale applications. Hence, fault-tolerant scheduling is an imperative step for large-scale computing systems. The existing fault-tolerant scheduling algorithms belong to static scheduling, and they allocate multiple copies of each task to several processors no matter whether processor failures affect the execution of tasks. Such active replication strategies not only waste resource but also sacrifice the makespan. What is more, they cannot guarantee the successful execution of applications. In this paper, we propose a fault-tolerant dynamic rescheduling algorithm named FTDR, which can overcome above drawbacks. FTDR keeps listening to the processor failure, and reschedules the suspended tasks once failures occur. Because FTDR reschedules the tasks that are suspended because of failures, it can tolerate an arbitrary number of failures. Randomly generated DAGs are tested in our experiments. Experimental results show that the proposed algorithm achieves good performance in terms of makespan and resource consumption compared with its direct competitors.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Kasahara, H., Narita, S.: Practical multiprocessor scheduling algorithms for efficient parallel processing. IEEE Trans. Comput. 33(11), 1023–1029 (1984)
Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)
Daoud, M.I., Kharma, N.: A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 68(4), 399–409 (2008)
Nesmachnow, S., Dorronsoro, B., Pecero, J., Bouvry, P.: Energy-aware scheduling on multicore heterogeneous grid computing systems. J. Grid Comput. 11(4), 653–680 (2013)
Arabnejad, H., Barbosa, J.: A budget constrained scheduling algorithm for workflow applications. J. Grid Comput. 12(4), 665–679 (2014)
Ranaweera, S., Agrawal, D.: A scalable task duplication based scheduling algorithm for heterogeneous systems. In: Proceedings of 2000 International Conference on Parallel Processing, pp. 383–390 (2000)
Bansal, S., Kumar, P., Singh, K.: An improved duplication strategy for scheduling precedence constrained graphs in multiprocessor systems. IEEE Trans. Parallel Distrib. Syst. 14(6), 533–544 (2003)
Shin, K., Cha, M., Jang, M., Jung, J., Yoon, W., Choi, S.: Task scheduling algorithm using minimized duplications in homogeneous systems. J. Parallel Distrib. Comput. 68(8), 1146–1156 (2008)
Tang, X., Li, K., Liao, G., Li, R.: List scheduling with duplication for heterogeneous computing systems. J. Parallel Distrib. Comput. 70(4), 323–329 (2010)
Song, I., Yoon, W., Jang, E., Choi, S.: Task scheduling algorithm with minimal redundant duplications in homogeneous multiprocessor system in Grid and Distributed Computing, pp. 238–245. Springer (2011)
Bansal, S., Kumar, P., Singh, K.: An improved duplication strategy for scheduling precedence constrained graphs in multiprocessor systems. IEEE Trans. Parallel Distrib. Syst. 14(6), 533–544 (2003)
Hagras, T., brevecek, J.J.: A high performance, low complexity algorithm for compile-time task scheduling in heterogeneous systems. Parallel Comput. 31(7), 653–670 (2005)
Liou, J., Palis, M.: An efficient task clustering heuristic for scheduling dags on multiprocessors. In: Proceedings of Parallel and Distributed Processing Symposium (1996)
Fangfa, F., Yuxin, B., Xinaan, H., Jinxiang, W., Minyan, Y., Jia, Z.: An objective-flexible clustering algorithm for task mapping and scheduling on cluster-based noc. In: 2010 10th Russian-Chinese Symposium on Laser Physics and Laser Technologies (RCSLPLT) and 2010 Academic Symposium on Optoelectronics Technology (ASOT), 28 2010-aug. 1 2010, pp. 369–373
Khan, M.A.: Scheduling for heterogeneous systems using constrained critical paths. Parallel Comput. 38(4), 175–193 (2012)
Stearley, J.: Defining and measuring supercomputer reliability, availability, and serviceability (ras). In: Proceedings of the Linux Clusters Institute Conference (2005)
Rahman, R.M., Barker, K., Alhajj, R.: Replica placement strategies in data grid. J. Grid Comput. 6(1), 103–123 (2008)
Yang, H., Luan, Z., Li, W., Qian, D.: Mapreduce workload modeling with statistical approach. J. grid Comput. 10(2), 279–310 (2012)
Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Eng. 1, 23–31 (1987)
Chakravorty, S.: A fault tolerance protocol for fast recovery. ProQuest (2008)
Yang, X., Wang, Z., Xue, J., Zhou, Y.: The reliability wall for exascale supercomputing. IEEE Trans. Comput. 61(6), 767–779 (2012)
Benoit, A., Hakem, M., Robert, Y.: Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In: IEEE International Symposium Parallel Distributed Processing, pp. 1–8. IEEE (2008)
Zhao, L., Ren, Y., Xiang, Y., Sakurai, K.: Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems. In: 12th IEEE International Conference High Performance Computing Communications, pp. 434–441. IEEE (2010)
Shatz, S.M., Wang, J.-P., Goto, M.: Task allocation for maximizing reliability of distributed computer systems. IEEE Trans. Comput. 41(9), 1156–1168 (1992)
Qin, X., Jiang, H.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Distrib. Comput. 65(8), 885–900 (2005)
Dongarra, J.J., Jeannot, E., Saule, E., Shi, Z.: Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pp. 280–288. ACM (2007)
Jeannot, E., Saule, E., Trystram, D.: Bi-objective approximation scheme for makespan and reliability optimization on uniform parallel machines. In: Euro-Par 2008–Parallel Processing, pp. 877–886. Springer (2008)
Girault, A., Saule, E., Trystram, D.: Reliability versus performance for critical applications. J. Parallel Distrib. Comput. 69(3), 326–336 (2009)
Tang, X., Li, K., Li, R., Veeravalli, B.: Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 70(9), 941–952 (2010)
Boeres, C., Sardiña, I. M., Drummond, L.: An efficient weighted bi-objective scheduling algorithm for heterogeneous systems. Parallel Comput. 37(8), 349–364 (2011)
Jeannot, E., Saule, E., Trystram, D.: Optimizing performance and reliability on heterogeneous parallel systems: Approximation algorithms and heuristics. J. Parallel Distrib. Comput. 72(2), 268–280 (2012)
Tao, Y., Jin, H., Wu, S., Shi, X., Shi, L.: Dependable grid workflow scheduling based on resource availability. J. Grid Comput. 11(1), 47–61 (2013)
Hakem, M., Butelle, F.: Reliability and scheduling on systems subject to failures. In: International Conference on Parallel Processing, pp. 38–38. IEEE (2007)
Qin, X., Jiang, H.: A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems. Parallel Comput. 32(5), 331–356 (2006)
Zheng, Q., Veeravalli, B.: On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices. J. Parallel Distrib. Comput. 69(3), 282–294 (2009)
Zheng, Q., Veeravalli, B., Tham, C.-K.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)
Benoit, A., Hakem, M., Robert, Y.: Realistic models and efficient algorithms for fault tolerant scheduling on heterogeneous platforms. In: 37th International Conference on Parallel Processing, pp. 246–253. IEEE (2008)
Khokhar, A., Prasanna, V., Shaaban, M., Wang, C.-L.: Heterogeneous computing: challenges and opportunities. Computer 26(6), 18–27 (1993)
Radulescu, A., Van Gemund, A.: Fast and effective task scheduling in heterogeneous systems. In: Proceedings of 9th Heterogeneous Computing Workshop, pp. 229–238 (2000)
Choudhury, P., Chakrabarti, P., Kumar, R.: Online scheduling of dynamic task graphs with communication and contention for multiprocessors, vol. 23, pp. 126–133 (2012)
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531
Jin, H., Sun, X.-H., Zheng, Z., Lan, Z., Xie, B.: Performance under failures of dag-based parallel computing. In: 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 236–243 (2009)
Daoud, M.I., Kharma, N.: A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 68(4), 399–409 (2008)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mei, J., Li, K., Zhou, X. et al. Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems. J Grid Computing 13, 507–525 (2015). https://doi.org/10.1007/s10723-015-9331-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-015-9331-1