Grid Resource Availability Prediction-Based Scheduling and Task Replication

Rood, Brent; Lewis, Michael J.

doi:10.1007/s10723-009-9135-2

Grid Resource Availability Prediction-Based Scheduling and Task Replication

Published: 02 September 2009

Volume 7, article number 479, (2009)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Journal of Grid Computing Aims and scope Submit manuscript

Grid Resource Availability Prediction-Based Scheduling and Task Replication

Download PDF

Brent Rood¹ &
Michael J. Lewis¹

247 Accesses
29 Citations
3 Altmetric
Explore all metrics

Abstract

The frequent and volatile unavailability of volunteer-based Grid computing resources challenges Grid schedulers to make effective job placements. The manner in which host resources become unavailable will have different effects on different jobs, depending on their runtime and their ability to be checkpointed or replicated. A multi-state availability model can help improve scheduling performance by capturing the various ways a resource may be available or unavailable to the Grid. This paper uses a multi-state model and analyzes a machine availability trace in terms of that model. Several prediction techniques then forecast resource transitions into the model’s states. We analyze the accuracy of our predictors, which outperform existing approaches. We also propose and study several classes of schedulers that utilize the predictions, and a method for combining scheduling factors. We characterize the inherent tradeoff between job makespan and the number of evictions due to failure, and demonstrate how our schedulers can navigate this tradeoff under various scenarios. Lastly, we propose job replication techniques, which our schedulers utilize to replicate those jobs that are most likely to fail. Our replication strategies outperform others, as measured by improved makespan and fewer redundant operations. In particular, we define a new metric for replication efficiency, and demonstrate that our multi-state availability predictor can provide information that allows our schedulers to be more efficient than others that blindly replicate all jobs or some static percentage of jobs.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Abu-Ghazaleh, N., Lewis, M.: Toward self organizing Grids. In: International Conference on High Performance Distributed Computing Hot Topics Session, pp. 324–327 (2006)
Amin, A., Ammar, R., Gokhale, S.: An efficient method to schedule tandem of real-time tasks in cluster computing with possible processor failures. In: Symposium on Computers and Communications, p. 1207 (2003)
Anderson, D.: Boinc: a system for public-resource computing and storage. In: IEEE/ACM Workshop on Grid Computing, pp. 4–10 (2004)
Androutsellis-Theotokis, S., Spinellis, D.: A survey of peer-to-peer content distribution tech. J. Am. Coll. Med. Coding Spec. 36(4), 335–371 (2004)
Google Scholar
Anglano, C., Canonico, M.: Fault-tolerant scheduling for bag-of-tasks Grid applications. In: Advances in Grid Computing - EGC 2005, pp. 630–639 (2005)
Arpaci, R., Dusseau, A., Vahdat, A., Liu, L., Anderson, T., Patterson, D.: The interaction of parallel and sequential workloads on a network of workstations. In: International Conference on Measurement and Modeling of Computer Systems, pp. 267–278 (1995)
Braun, T., Siegel, H., Beck, N.: A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 61(6), 810–837 (2001)
Article Google Scholar
Cardinale, Y., Casanova, H.: An evaluation of job scheduling strategies for divisible loads on Grid platforms. In: High Performance Computing and Simulation Conference, pp. 705–712 (2006)
Casanova, H., Zagorodnov, D., Berman, F., Legrand, A.: Heuristics for scheduling parameter sweep applications in Grid environments. In: HCW ’00: Proceedings of the 9th Heterogeneous Computing Workshop, p. 349. IEEE Computer Society, Washington, DC (2000)
Chapter Google Scholar
Chun, B., Vahdat, A.: Workload and failure characterization on a large-scale federated testbed. Technical Report IRB-TR-03-040, Intel Research Berkeley (2003)
Dail, H., Casanova, H., Berman, F.: A decoupled scheduling approach for Grid application development environments. J. Parallel Distrib. Comput. 63(5), 505–524 (2003)
Article MATH Google Scholar
Dinda, P., O’Hallaron, D.: An extensive toolkit for resource prediction in distributed systems. Technical Report CMU-CS-99-138, Carnegie Mellon University (1999)
Dogan, A., Ozguner, F.: Biobjective scheduling algorithms for execution time-reliability trade-off in heterogeneous computing systems. Comput. J. 48(3), 300–314 (2005)
Article Google Scholar
E.G. for EsciencE: E.G. for EsciencE homepage. http://public.eu-egee.org/ (2008)
Foster, I., Iamnitchi, A.: On death, taxes, and the convergence of peer-to-peer and Grid computing. In: International Workshop on Peer-To-Peer Systems (2003)
Frey, J., Tannenbaum, T., Livny, M., Foster, I., Tuecke, S.: Condor-g: a computation management agent for multi-institutional Grids. In: International Conference on High Performance Distributed Computing, pp. 55–63 (2001)
Fujimoto, N., Hagihara, K.: A comparison among Grid scheduling algorithms for independent coarse-grained tasks. In: International Symosium on Applications and the Internet, pp. 674–680. IEEE Computer Society, Washington, DC (2004)
Google Scholar
O.S. Grid: O.S. Grid homepage. http://www.opensciencegrid.org/ (2008)
Kang, W., Grimshaw, A.S.: Failure prediction in computational Grids. In: Simulation Symposium, pp. 275–282 (2007)
Kartik, S., Murthy, C.: Task allocation algorithms for maximizing reliability of distributed computing systems. IEEE Trans. Comput. 41(9), 1156–1168 (1992)
Article Google Scholar
Kondo, D., Anderson, D., McLeod, J.: Performance evaluation of scheduling policies for volunteer computing. In: International Conference on e-Science, pp. 415–422 (2007)
Kondo, D., Chien, A., Casanova, H.: Resource management for rapid application turnaround on enterprise desktop Grids. In: International Conference on High Performance Computing, p. 17 (2004)
Lamehamedi, H., Szymanski, B., Shentu, Z.: Data replication strategies in Grid environments. In: in Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP02), pp. 378–383. Press (2002)
Lewis, M., Grimshaw, A.: The core legion object model. In: International Conference on High Performance Distributed Computing, pp. 551–561 (1996)
Li, Y., Mascagni, M.: Improving performance via computational replication on a large-scale computational Grid. In: CCGRID ’03: Proceedings of the 3st International Symposium on Cluster Computing and the Grid, p. 442. IEEE Computer Society, Washington, DC (2003)
Google Scholar
Litke, A., Skoutas, D., Tserpes, K., Varvarigou, T.: Efficient task replication and management for adaptive fault tolerance in mobile Grid environments. Future Gener. Comput. Syst. 23(2), 163–178 (2007)
Article Google Scholar
Litzkow, M., Livny, M., Mutka, M.: Condor—a hunter of idle workstations. In: International Conference on Distributed Computing Systems, pp. 104–111 (1988)
Menascé, D.A., Saha, D., da Silva Porto, S.C., Almeida, V.A.F., Tripathi S.K.: Static and dynamic processor scheduling disciplines in heterogeneous parallel architectures. Parallel J. Distrib. Comput. 28(1), 1–18 (1995)
Article MATH Google Scholar
Mickens, J., Noble, B.: Predicting node availability in peer-to-peer networks. In: International Conference on Measurement and Modeling of Computer Systems (2005)
Mickens, J., Noble, B.: Exploiting availability prediction in distributed systems. In: Network Systems Design and Implementation, pp. 73–86 (2006)
Mickens, J., Noble, B.: Improving distributed system performance using machine availability prediction. In: International Conference on Measurement and Modeling of Computer Systems Performance Evaluation Review, vol. 34(2) (2006)
Nurmi, D., Brevik, J., Wolski, R.: Modeling machine availability in enterprise and wide-area distributed computing environments. In: Europar, pp. 432–441 (2005)
Planetlab: P. L. A. open platform for developing debugging and accessing planetary scale services. http://www.planet-lab.org/ (2008)
Pietrobon, V., Orlando, S.: Performance fault prediction models. Technical Report CS-2004-3, University of Venice (2004)
Qin, X., Jiang, H., Xie, C., Han, Z.: Reliability-driven scheduling for real-time tasks with precedence constraints in heterogeneous distributed systems. In: International Conference on Parallel and Distributed Computing, pp. 617–623 (2000)
Ramakrishnan, L., Reed, D.A.: Performability modeling for scheduling and fault tolerance strategies for scientific workflows. In: HPDC ’08: Proceedings of the 17th International Symposium on High Performance Distributed Computing, pp. 23–34. ACM, New York (2008)
Chapter Google Scholar
Ranganathan, K., Foster, I.: Identifying dynamic replication strategies for a high performance data Grid. In: In Proc. of the International Grid Computing Workshop, pp. 75–86 (2001)
Ren, X., Eigenmann, R.: Empirical studies on the behavior of resource availability in fine-grained cycle sharing systems. In: International Conference on Parallel Processing, pp. 3–11 (2006)
Ren, X., Lee, S., Eigenmann, R., Bagchi, S.: Resource failure prediction in fine-grained cycle sharing system. In: International Conference on High Performance Distributed Computing (2006)
Ren, X., Lee, S., Eigenmann, R., Bagchi, S.: Prediction of resource availability in fine-grained cycle sharing systems empirical evaluation. Journal of Grid Computing 5(2), 173–195 (2007)
Article Google Scholar
Rood, B., Lewis, M.: Multi-state Grid resource availability characterization. In: International Conference on Grid Computing, pp. 42–49 (2007)
Rood, B., Lewis, M.: Scheduling on the Grid via multi-state resource availability prediction. In: International Conference on Grid Computing (2008)
Sahoo, R., Oliner, A., Rish, I., Gupta, M., Moreira, J., Ma, S., Vilalta, R., Sivasubramaniam, A.: Critical event prediction for proactive management in large-scale computer clusters. In: Special Interest Group on Knowledge Discovery and Data Mining, pp. 426–435 (2003)
Santos-neto, E., Cirne, W., Brasileiro, F., Lima, R., Grande, C.: Exploiting replication and data reuse to efficiently schedule data-intensive applications on Grids. In: Proceedings of the 10th Workshop on Job Scheduling Strategies for Parallel Processing, pp. 210–232 (2004)
Silva, D.P.D., Cirne, W., Brasileiro, F.V., Grande, C.: Trading cycles for information: using replication to schedule bag-of-tasks applications on computational Grids. In: Applications on Computational Grids, in Proc of Euro-Par 2003, pp. 169–180 (2003)
Srinivasan, S., Jha, N.: Safety and reliability-driven task allocation in distributed systems. In: International Conference on Parallel and Distributed Systems, pp. 238–251 (1999)
Teragrid: Teragrid homepage. http://www.teragrid.org (2008)
Vilalta, R., Ma, S.: Predicting rare events in temporal domains. In: International Conference on Data Mining, p. 474 (2002)
Weiss, G., Hirsh, H.: Learning to predict rare events in categorical time-series data. In: International Conference on Machine Learning, pp. 83–90 (1998)
Weissman, J.B.: Fault tolerant computing on the Grid: what are my options. Technical report, University of Texas at San Antonio (1998)
Wolski, R., Spring, N., Hayes, J.: The network weather service: a distributed resource performance forecasting service for metacomputing. Future Gener. Comput. Syst. 15, 757–768 (1999)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, State University of New York at Binghamton, P.O. Box 6000, Binghamton, NY, 13902-6000, USA
Brent Rood & Michael J. Lewis

Authors

Brent Rood
View author publications
You can also search for this author in PubMed Google Scholar
Michael J. Lewis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Brent Rood.

Additional information

This research is supported by NSF Award CNS-0454298.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rood, B., Lewis, M.J. Grid Resource Availability Prediction-Based Scheduling and Task Replication. J Grid Computing 7, 479 (2009). https://doi.org/10.1007/s10723-009-9135-2

Download citation

Received: 20 February 2009
Accepted: 18 August 2009
Published: 02 September 2009
DOI: https://doi.org/10.1007/s10723-009-9135-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Grid Resource Availability Prediction-Based Scheduling and Task Replication

Abstract

Article PDF

Similar content being viewed by others

Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization

On Interactions among Scheduling Policies: Finding Efficient Queue Setup Using High-Resolution Simulations

Exploring the Impact of Node Failures on the Resource Allocation for Parallel Jobs

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Grid Resource Availability Prediction-Based Scheduling and Task Replication

Abstract

Article PDF

Similar content being viewed by others

Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization

On Interactions among Scheduling Policies: Finding Efficient Queue Setup Using High-Resolution Simulations

Exploring the Impact of Node Failures on the Resource Allocation for Parallel Jobs

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation