Abstract
The frequent and volatile unavailability of volunteer-based Grid computing resources challenges Grid schedulers to make effective job placements. The manner in which host resources become unavailable will have different effects on different jobs, depending on their runtime and their ability to be checkpointed or replicated. A multi-state availability model can help improve scheduling performance by capturing the various ways a resource may be available or unavailable to the Grid. This paper uses a multi-state model and analyzes a machine availability trace in terms of that model. Several prediction techniques then forecast resource transitions into the model’s states. We analyze the accuracy of our predictors, which outperform existing approaches. We also propose and study several classes of schedulers that utilize the predictions, and a method for combining scheduling factors. We characterize the inherent tradeoff between job makespan and the number of evictions due to failure, and demonstrate how our schedulers can navigate this tradeoff under various scenarios. Lastly, we propose job replication techniques, which our schedulers utilize to replicate those jobs that are most likely to fail. Our replication strategies outperform others, as measured by improved makespan and fewer redundant operations. In particular, we define a new metric for replication efficiency, and demonstrate that our multi-state availability predictor can provide information that allows our schedulers to be more efficient than others that blindly replicate all jobs or some static percentage of jobs.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Abu-Ghazaleh, N., Lewis, M.: Toward self organizing Grids. In: International Conference on High Performance Distributed Computing Hot Topics Session, pp. 324–327 (2006)
Amin, A., Ammar, R., Gokhale, S.: An efficient method to schedule tandem of real-time tasks in cluster computing with possible processor failures. In: Symposium on Computers and Communications, p. 1207 (2003)
Anderson, D.: Boinc: a system for public-resource computing and storage. In: IEEE/ACM Workshop on Grid Computing, pp. 4–10 (2004)
Androutsellis-Theotokis, S., Spinellis, D.: A survey of peer-to-peer content distribution tech. J. Am. Coll. Med. Coding Spec. 36(4), 335–371 (2004)
Anglano, C., Canonico, M.: Fault-tolerant scheduling for bag-of-tasks Grid applications. In: Advances in Grid Computing - EGC 2005, pp. 630–639 (2005)
Arpaci, R., Dusseau, A., Vahdat, A., Liu, L., Anderson, T., Patterson, D.: The interaction of parallel and sequential workloads on a network of workstations. In: International Conference on Measurement and Modeling of Computer Systems, pp. 267–278 (1995)
Braun, T., Siegel, H., Beck, N.: A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 61(6), 810–837 (2001)
Cardinale, Y., Casanova, H.: An evaluation of job scheduling strategies for divisible loads on Grid platforms. In: High Performance Computing and Simulation Conference, pp. 705–712 (2006)
Casanova, H., Zagorodnov, D., Berman, F., Legrand, A.: Heuristics for scheduling parameter sweep applications in Grid environments. In: HCW ’00: Proceedings of the 9th Heterogeneous Computing Workshop, p. 349. IEEE Computer Society, Washington, DC (2000)
Chun, B., Vahdat, A.: Workload and failure characterization on a large-scale federated testbed. Technical Report IRB-TR-03-040, Intel Research Berkeley (2003)
Dail, H., Casanova, H., Berman, F.: A decoupled scheduling approach for Grid application development environments. J. Parallel Distrib. Comput. 63(5), 505–524 (2003)
Dinda, P., O’Hallaron, D.: An extensive toolkit for resource prediction in distributed systems. Technical Report CMU-CS-99-138, Carnegie Mellon University (1999)
Dogan, A., Ozguner, F.: Biobjective scheduling algorithms for execution time-reliability trade-off in heterogeneous computing systems. Comput. J. 48(3), 300–314 (2005)
E.G. for EsciencE: E.G. for EsciencE homepage. http://public.eu-egee.org/ (2008)
Foster, I., Iamnitchi, A.: On death, taxes, and the convergence of peer-to-peer and Grid computing. In: International Workshop on Peer-To-Peer Systems (2003)
Frey, J., Tannenbaum, T., Livny, M., Foster, I., Tuecke, S.: Condor-g: a computation management agent for multi-institutional Grids. In: International Conference on High Performance Distributed Computing, pp. 55–63 (2001)
Fujimoto, N., Hagihara, K.: A comparison among Grid scheduling algorithms for independent coarse-grained tasks. In: International Symosium on Applications and the Internet, pp. 674–680. IEEE Computer Society, Washington, DC (2004)
O.S. Grid: O.S. Grid homepage. http://www.opensciencegrid.org/ (2008)
Kang, W., Grimshaw, A.S.: Failure prediction in computational Grids. In: Simulation Symposium, pp. 275–282 (2007)
Kartik, S., Murthy, C.: Task allocation algorithms for maximizing reliability of distributed computing systems. IEEE Trans. Comput. 41(9), 1156–1168 (1992)
Kondo, D., Anderson, D., McLeod, J.: Performance evaluation of scheduling policies for volunteer computing. In: International Conference on e-Science, pp. 415–422 (2007)
Kondo, D., Chien, A., Casanova, H.: Resource management for rapid application turnaround on enterprise desktop Grids. In: International Conference on High Performance Computing, p. 17 (2004)
Lamehamedi, H., Szymanski, B., Shentu, Z.: Data replication strategies in Grid environments. In: in Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP02), pp. 378–383. Press (2002)
Lewis, M., Grimshaw, A.: The core legion object model. In: International Conference on High Performance Distributed Computing, pp. 551–561 (1996)
Li, Y., Mascagni, M.: Improving performance via computational replication on a large-scale computational Grid. In: CCGRID ’03: Proceedings of the 3st International Symposium on Cluster Computing and the Grid, p. 442. IEEE Computer Society, Washington, DC (2003)
Litke, A., Skoutas, D., Tserpes, K., Varvarigou, T.: Efficient task replication and management for adaptive fault tolerance in mobile Grid environments. Future Gener. Comput. Syst. 23(2), 163–178 (2007)
Litzkow, M., Livny, M., Mutka, M.: Condor—a hunter of idle workstations. In: International Conference on Distributed Computing Systems, pp. 104–111 (1988)
Menascé, D.A., Saha, D., da Silva Porto, S.C., Almeida, V.A.F., Tripathi S.K.: Static and dynamic processor scheduling disciplines in heterogeneous parallel architectures. Parallel J. Distrib. Comput. 28(1), 1–18 (1995)
Mickens, J., Noble, B.: Predicting node availability in peer-to-peer networks. In: International Conference on Measurement and Modeling of Computer Systems (2005)
Mickens, J., Noble, B.: Exploiting availability prediction in distributed systems. In: Network Systems Design and Implementation, pp. 73–86 (2006)
Mickens, J., Noble, B.: Improving distributed system performance using machine availability prediction. In: International Conference on Measurement and Modeling of Computer Systems Performance Evaluation Review, vol. 34(2) (2006)
Nurmi, D., Brevik, J., Wolski, R.: Modeling machine availability in enterprise and wide-area distributed computing environments. In: Europar, pp. 432–441 (2005)
Planetlab: P. L. A. open platform for developing debugging and accessing planetary scale services. http://www.planet-lab.org/ (2008)
Pietrobon, V., Orlando, S.: Performance fault prediction models. Technical Report CS-2004-3, University of Venice (2004)
Qin, X., Jiang, H., Xie, C., Han, Z.: Reliability-driven scheduling for real-time tasks with precedence constraints in heterogeneous distributed systems. In: International Conference on Parallel and Distributed Computing, pp. 617–623 (2000)
Ramakrishnan, L., Reed, D.A.: Performability modeling for scheduling and fault tolerance strategies for scientific workflows. In: HPDC ’08: Proceedings of the 17th International Symposium on High Performance Distributed Computing, pp. 23–34. ACM, New York (2008)
Ranganathan, K., Foster, I.: Identifying dynamic replication strategies for a high performance data Grid. In: In Proc. of the International Grid Computing Workshop, pp. 75–86 (2001)
Ren, X., Eigenmann, R.: Empirical studies on the behavior of resource availability in fine-grained cycle sharing systems. In: International Conference on Parallel Processing, pp. 3–11 (2006)
Ren, X., Lee, S., Eigenmann, R., Bagchi, S.: Resource failure prediction in fine-grained cycle sharing system. In: International Conference on High Performance Distributed Computing (2006)
Ren, X., Lee, S., Eigenmann, R., Bagchi, S.: Prediction of resource availability in fine-grained cycle sharing systems empirical evaluation. Journal of Grid Computing 5(2), 173–195 (2007)
Rood, B., Lewis, M.: Multi-state Grid resource availability characterization. In: International Conference on Grid Computing, pp. 42–49 (2007)
Rood, B., Lewis, M.: Scheduling on the Grid via multi-state resource availability prediction. In: International Conference on Grid Computing (2008)
Sahoo, R., Oliner, A., Rish, I., Gupta, M., Moreira, J., Ma, S., Vilalta, R., Sivasubramaniam, A.: Critical event prediction for proactive management in large-scale computer clusters. In: Special Interest Group on Knowledge Discovery and Data Mining, pp. 426–435 (2003)
Santos-neto, E., Cirne, W., Brasileiro, F., Lima, R., Grande, C.: Exploiting replication and data reuse to efficiently schedule data-intensive applications on Grids. In: Proceedings of the 10th Workshop on Job Scheduling Strategies for Parallel Processing, pp. 210–232 (2004)
Silva, D.P.D., Cirne, W., Brasileiro, F.V., Grande, C.: Trading cycles for information: using replication to schedule bag-of-tasks applications on computational Grids. In: Applications on Computational Grids, in Proc of Euro-Par 2003, pp. 169–180 (2003)
Srinivasan, S., Jha, N.: Safety and reliability-driven task allocation in distributed systems. In: International Conference on Parallel and Distributed Systems, pp. 238–251 (1999)
Teragrid: Teragrid homepage. http://www.teragrid.org (2008)
Vilalta, R., Ma, S.: Predicting rare events in temporal domains. In: International Conference on Data Mining, p. 474 (2002)
Weiss, G., Hirsh, H.: Learning to predict rare events in categorical time-series data. In: International Conference on Machine Learning, pp. 83–90 (1998)
Weissman, J.B.: Fault tolerant computing on the Grid: what are my options. Technical report, University of Texas at San Antonio (1998)
Wolski, R., Spring, N., Hayes, J.: The network weather service: a distributed resource performance forecasting service for metacomputing. Future Gener. Comput. Syst. 15, 757–768 (1999)
Author information
Authors and Affiliations
Corresponding author
Additional information
This research is supported by NSF Award CNS-0454298.
Rights and permissions
About this article
Cite this article
Rood, B., Lewis, M.J. Grid Resource Availability Prediction-Based Scheduling and Task Replication. J Grid Computing 7, 479 (2009). https://doi.org/10.1007/s10723-009-9135-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10723-009-9135-2