Abstract
In order to achieve high level of reliability and availability, the grid infrastructure should be a foolproof fault tolerant. Fault tolerance plays a key role in order to assert availability and reliability of a grid system. Since the failure of resources affects job execution fatally, fault tolerance service is essential to satisfy QoS requirement in grid computing.
In this paper we proposed two hybrid fault tolerance techniques (FTTs) that are called alternate task with checkpoint and alternate task with retry. These proposed hybrid FTTs inherit the good features and overcome the limitations of workflow level FTT and task level FTT. We evaluate the performance of our proposed FTTs under different experimental environments. Finally, we conclude that alternate task with checkpoint improves the reliability of a grid system more significantly than alternate task with retry.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Jankowski G, Januszewski R, Mikolajczak R, Kovacs J (2008) Improving the fault tolerance level within the GRID computing environment-integration with the low-level checkpointing packages. CoreGRID Technical Report Number TR-0158, June 16
Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst (February), 63–75
Mattern F (1993) Efficient algorithms for distributed snapshots and global virtual time approximation. J Parallel Distrib Comput, 423–434
Abawajy JH (2004) Fault-tolerant scheduling policy for grid computing systems. In: 18th International parallel and distributed processing symposium (IPDPS’04)—workshop 13, 2004, vol 14, p 238b
Lee H, Chung K, Chin S, Lee J, Lee D, Park S, Yu H (2005) A resource management and fault tolerance services in grid computing. J Parallel Distrib Comput 65(11):1305–1317
Hwang S, Kesselman C (2004) A flexible framework for fault tolerance in the grid. J Grid Comput 1(3):251–272. doi:10.1023/B:GRID.0000035187.54694.75
Abawajy JH (2004) Fault-tolerant scheduling policy for grid computing systems. In: 18th International parallel and distributed processing symposium (IPDPS’04), Santa Fe, New Mexico, April 26–30, 2004. IEEE Computer Society Press, Los Alamitos, pp 238–244
Yu J, Buyya R (2005) A taxonomy of workflow management systems for grid computing. J Grid Comput 3(3–4):171–200. doi:10.1007/s10723-005-9010-8
Gartner FC (1999) Fundamentals of fault-tolerant distributed computing in asynchronous environments. J ACM Comput Surv 31(1):1–26
Anglano C, Canonico M (2005) Fault-tolerant scheduling for bag-of-tasks grid applications. In: Advances in grid computing—EGC 2005. Lecture notes in computer science, vol 3470/2005. Springer, Berlin/Heidelberg. ISSN: 0302-9743 Print. doi:10.1007/b137919, ISBN: 978-3-540-26918-2, pp 630–639
Vanderster DC, Dimopoulos NJ, Sobie RJ (2007) Intelligent selection of fault tolerance techniques on the grid. In: Third IEEE international conference on e-science and grid computing. IEEE Computer Society Press, Los Alamitos, ISSN: 0-7695-3064-8
Gioiosa R, Sancho JC, Jiang S, Petrini F, Davis K (2005) Incremental check-pointing at kernel level: a foundation for fault tolerance for parallel computers. In: Proceedings of the 2005 ACM/IEEE SC|05 conference (SC’05)
Hwang S, Kesselman C (2003) Grid workflow: a flexible failure handling framework for the grid. In: 12th IEEE international symposium on high performance distributed computing (HPDC’03), Seattle, Washington, USA, June 22–24, 2003. IEEE Computer Society Press, Los Alamitos
Buyya R (2002) Economic-based distributed resource management and scheduling for grid computing, Ph.D. Thesis, Monash University, Melbourne, Australia, April 12
Fahringer T et al (2005) Truong. ASKALON: a tool set for cluster and Grid computing. J Concurr Comput Pract Exp 17(2–4):143–169
von Laszewski G (2006) Workflow Concepts of the Java CoG Kit. J Grid Comput 3(3–4):239–258
Ludascher B et al (2006) Scientific workflow management and the KEPLER system. J Concurr Comput Pract Exp 18(10):1039–1065
Yu J, Buyya R (2004) A novel architecture for realizing grid workflow using tuple spaces. In: 5th IEEE/ACM international workshop on grid computing (GRID 2004), Pittsburgh, USA, 2004. IEEE Computer Society Press, Los Alamitos, ISBN: 0-7695-2256-4
Buyya R, Murshed M (2002) GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for Grid computing. J Concurr Comput Pract Exp 13(13–15)
Testbed WWG (2008) http://gridbus.cs.mu.oz.au/sc2003/list.html [August 2008]
Nazir B, Qureshi K, Manuel P (2008) Adaptive checkpointing strategy to tolerate faults in economy based grid, J Supercomput. doi:10.1007/s11227-008-0245-6
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Qureshi, K., Khan, F.G., Manuel, P. et al. A hybrid fault tolerance technique in grid computing system. J Supercomput 56, 106–128 (2011). https://doi.org/10.1007/s11227-009-0345-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-009-0345-y