A hybrid fault tolerance technique in grid computing system

Qureshi, Kalim; Khan, Fiaz Gul; Manuel, Paul; Nazir, Babar

doi:10.1007/s11227-009-0345-y

A hybrid fault tolerance technique in grid computing system

Published: 19 January 2010

Volume 56, pages 106–128, (2011)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

The Journal of Supercomputing Aims and scope Submit manuscript

A hybrid fault tolerance technique in grid computing system

Download PDF

Kalim Qureshi¹,
Fiaz Gul Khan²,
Paul Manuel¹ &
…
Babar Nazir²

291 Accesses
19 Citations
Explore all metrics

Abstract

In order to achieve high level of reliability and availability, the grid infrastructure should be a foolproof fault tolerant. Fault tolerance plays a key role in order to assert availability and reliability of a grid system. Since the failure of resources affects job execution fatally, fault tolerance service is essential to satisfy QoS requirement in grid computing.

In this paper we proposed two hybrid fault tolerance techniques (FTTs) that are called alternate task with checkpoint and alternate task with retry. These proposed hybrid FTTs inherit the good features and overcome the limitations of workflow level FTT and task level FTT. We evaluate the performance of our proposed FTTs under different experimental environments. Finally, we conclude that alternate task with checkpoint improves the reliability of a grid system more significantly than alternate task with retry.

References

Jankowski G, Januszewski R, Mikolajczak R, Kovacs J (2008) Improving the fault tolerance level within the GRID computing environment-integration with the low-level checkpointing packages. CoreGRID Technical Report Number TR-0158, June 16
Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst (February), 63–75
Mattern F (1993) Efficient algorithms for distributed snapshots and global virtual time approximation. J Parallel Distrib Comput, 423–434
Abawajy JH (2004) Fault-tolerant scheduling policy for grid computing systems. In: 18th International parallel and distributed processing symposium (IPDPS’04)—workshop 13, 2004, vol 14, p 238b
Lee H, Chung K, Chin S, Lee J, Lee D, Park S, Yu H (2005) A resource management and fault tolerance services in grid computing. J Parallel Distrib Comput 65(11):1305–1317
Article Google Scholar
Hwang S, Kesselman C (2004) A flexible framework for fault tolerance in the grid. J Grid Comput 1(3):251–272. doi:10.1023/B:GRID.0000035187.54694.75
Article Google Scholar
Abawajy JH (2004) Fault-tolerant scheduling policy for grid computing systems. In: 18th International parallel and distributed processing symposium (IPDPS’04), Santa Fe, New Mexico, April 26–30, 2004. IEEE Computer Society Press, Los Alamitos, pp 238–244
Chapter Google Scholar
Yu J, Buyya R (2005) A taxonomy of workflow management systems for grid computing. J Grid Comput 3(3–4):171–200. doi:10.1007/s10723-005-9010-8
Article Google Scholar
Gartner FC (1999) Fundamentals of fault-tolerant distributed computing in asynchronous environments. J ACM Comput Surv 31(1):1–26
Article MathSciNet Google Scholar
Anglano C, Canonico M (2005) Fault-tolerant scheduling for bag-of-tasks grid applications. In: Advances in grid computing—EGC 2005. Lecture notes in computer science, vol 3470/2005. Springer, Berlin/Heidelberg. ISSN: 0302-9743 Print. doi:10.1007/b137919, ISBN: 978-3-540-26918-2, pp 630–639
Chapter Google Scholar
Vanderster DC, Dimopoulos NJ, Sobie RJ (2007) Intelligent selection of fault tolerance techniques on the grid. In: Third IEEE international conference on e-science and grid computing. IEEE Computer Society Press, Los Alamitos, ISSN: 0-7695-3064-8
Google Scholar
Gioiosa R, Sancho JC, Jiang S, Petrini F, Davis K (2005) Incremental check-pointing at kernel level: a foundation for fault tolerance for parallel computers. In: Proceedings of the 2005 ACM/IEEE SC|05 conference (SC’05)
Hwang S, Kesselman C (2003) Grid workflow: a flexible failure handling framework for the grid. In: 12th IEEE international symposium on high performance distributed computing (HPDC’03), Seattle, Washington, USA, June 22–24, 2003. IEEE Computer Society Press, Los Alamitos
Google Scholar
Buyya R (2002) Economic-based distributed resource management and scheduling for grid computing, Ph.D. Thesis, Monash University, Melbourne, Australia, April 12
Fahringer T et al (2005) Truong. ASKALON: a tool set for cluster and Grid computing. J Concurr Comput Pract Exp 17(2–4):143–169
Article Google Scholar
von Laszewski G (2006) Workflow Concepts of the Java CoG Kit. J Grid Comput 3(3–4):239–258
Google Scholar
Ludascher B et al (2006) Scientific workflow management and the KEPLER system. J Concurr Comput Pract Exp 18(10):1039–1065
Article Google Scholar
Yu J, Buyya R (2004) A novel architecture for realizing grid workflow using tuple spaces. In: 5th IEEE/ACM international workshop on grid computing (GRID 2004), Pittsburgh, USA, 2004. IEEE Computer Society Press, Los Alamitos, ISBN: 0-7695-2256-4
Google Scholar
Buyya R, Murshed M (2002) GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for Grid computing. J Concurr Comput Pract Exp 13(13–15)
Google Scholar
Testbed WWG (2008) http://gridbus.cs.mu.oz.au/sc2003/list.html [August 2008]
Nazir B, Qureshi K, Manuel P (2008) Adaptive checkpointing strategy to tolerate faults in economy based grid, J Supercomput. doi:10.1007/s11227-008-0245-6

Download references

Author information

Authors and Affiliations

Information Science Dept., Kuwait University, Kuwait City, Kuwait
Kalim Qureshi & Paul Manuel
COMSATS Institute of Information Technology, Abbottabad, Pakistan
Fiaz Gul Khan & Babar Nazir

Authors

Kalim Qureshi
View author publications
You can also search for this author in PubMed Google Scholar
Fiaz Gul Khan
View author publications
You can also search for this author in PubMed Google Scholar
Paul Manuel
View author publications
You can also search for this author in PubMed Google Scholar
Babar Nazir
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kalim Qureshi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qureshi, K., Khan, F.G., Manuel, P. et al. A hybrid fault tolerance technique in grid computing system. J Supercomput 56, 106–128 (2011). https://doi.org/10.1007/s11227-009-0345-y

Download citation

Published: 19 January 2010
Issue Date: April 2011
DOI: https://doi.org/10.1007/s11227-009-0345-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A hybrid fault tolerance technique in grid computing system

Abstract

Article PDF

Similar content being viewed by others

A Hybrid Fault Tolerant Scheduler for Computational Grid Environment

Fault Tolerant Task Scheduling on Computational Grid Using Checkpointing Under Transient Faults

A Combined Approach: Proactive and Reactive Failure Handling for Efficient Job Execution in Computational Grid

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A hybrid fault tolerance technique in grid computing system

Abstract

Article PDF

Similar content being viewed by others

A Hybrid Fault Tolerant Scheduler for Computational Grid Environment

Fault Tolerant Task Scheduling on Computational Grid Using Checkpointing Under Transient Faults

A Combined Approach: Proactive and Reactive Failure Handling for Efficient Job Execution in Computational Grid

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation