1 Introduction

Cloud computing is the on-demand delivery of computing services such as servers, storage, databases, networking, software, analytics, intelligence over the Internet to offer faster innovation, flexible resources, and economies of scale [1]. Because of the sustained growth rate for cloud services, there has been substantial growth in energy consumption (EC) of the cloud data centers. This demand for energy consumption results leads to high operating costs and higher carbon emissions [2]. As a consequence, the data centers become unsustainable [3]. The information technology (IT) infrastructure is the main contributor to energy consumption in a data center, which includes servers and other IT components. Optimization of energy consumed by the cloud-related infrastructure is the major research deal in recent years [4].

Most of the business and complex scientific applications used workflows to analyze complex data sets and to conduct simulation experiments effectively [5]. A scientific workflow involves individual data transformations and analysis steps, and mechanisms to link them according to the data dependencies among them [6]. Several scientific applications comprise numerous tasks with precedence constraints, including chemistry, physics, earth science, astronomy, computer science, and bioinformatics. The scientific workflow applications are highly complex with various sizes of tasks, which are in the form of a Directed Acyclic Graph (DAG). Therefore, scientific workflow management mostly deals with the large-volumes of data. Such complex workflow applications demand high computing power and high system reliability. With the rapid proliferation of accessing scientific applications, it is necessary to focus on developing the energy-aware workflow scheduling model in a cloud environment.

The scheduling of workflow tasks in distributed platforms such as grids and cloud environments has been extensively considered for many years. Researchers have developed algorithms geared towards different environments: from small-scale homogeneous clusters to large-scale community grids, to the contemporary paradigm, heterogeneous, utility-based, and resource-rich cloud computing [5]. Clouds provide unlimited computing power which is a more relevant platform to execute complex workflow applications. But the computational systems in cloud data centers are not failure-free and any type of fault may be critical to the running application [7]. Particularly, the transient faults increase by the Dynamic Voltage and Frequency Scaling (DVFS) approach. Such failures may impact the execution nature of the program and provoke unpredictable results [8]. In this work, we proposed an efficient heuristic for energy efficient and reliability aware workflow tasks scheduling in cloud environment (EERS) which maximizes the reliability for workflow tasks while conserving energy.

The primary contributions of this article are summarized as follows:

  • An energy and reliability aware workflow scheduling algorithm, called EERS is proposed to reduce the application energy consumption while maximizing the system reliability.

  • The proposed EERS approach consists of five sub-algorithms such as task rank calculation, task clustering, sub-target time distribution, cluster-VM mapping algorithm, and slack algorithm to meet the stated objectives.

  • Along with energy and reliability aware resources management of our approach, the task clustering algorithm reduced communication cost and the slack algorithm reduced a significant amount of energy consumption.

  • The performance of the proposed EERS approach was evaluated through several numerical experiments. And the experimental results show that the proposed approach can achieve a significant amount of energy-saving while maximizing the application reliability without regard to the diverse workflow structures.

The rest of the article is organized as follows. The related work is discussed in Sect. 2. We propose system architecture and models in Sect. 3. Then Sect. 4 presents the five sub-algorithms and EERS algorithm implementation. Section 5 gives the algorithm performance evaluation and finally the conclusion and future work discussed in Sect. 6.

2 Related Works

In recent years, there has been significant attention in developing the models to provide workflow scheduling and energy-efficient resource management in the cloud environment. The optimal workflow scheduling and resource management depend on various factors such as the arrival rate of the requests, availability of the resources, and the workload of workflow tasks.

Some of the task scheduling mechanisms for the cloud environment addressed optimization parameters such as schedule length and cost. A fuzzy dominance-based HEFT algorithm to address the cost and makespan of workflow applications in clouds [9], which uses real-world pricing and resource models. A cost-aware large-scale workflow scheduling approach in [10] with DAG splitting mechanism reduces the monetary cost of application by maximizing the VM utilization. A list-based heuristic called Heterogeneous Earliest Finish Time (HEFT) [11] approach gives the best makespan. A heuristic called Jaya algorithm in [12] is used to reduce both the computation and communication costs while scheduling the workflow tasks in the cloud environment. A Case Library and Pareto Solution-based hybrid Genetic Algorithm (CLPS-GA) [13] has focused on two objectives such as makespan and energy conservation while scheduling the workflow tasks. It relies on a case library and multiparent crossover operator to effectively ensure the stability, diversity, and convergence in the solution. An approach in [14] has employed an immune genetic algorithm to resolve the QoS constraint satisfaction by considering five objectives for workflow scheduling in the cloud environment. The deadline-based workflow scheduling algorithm [15] employs the Particle Swarm Optimization (PSO) technique to diminish the overall execution cost while scheduling the scientific workflow applications in the Infrastructure-as-a-Service (IaaS) clouds without violating the deadline constraints. However, these works are not considered either energy or reliability objectives.

Many efficient scheduling techniques to maximize system reliability have been studied [16,17,18]. Some of the task scheduling approaches developed for on-demanding computing environments addressed the reliability of the system and performance. A reliability model and load-balanced scheduling approach introduced in [19] by using Colored Petri Nets (CPN). A task scheduling model is developed in [20] using Ant Colony Optimization (ACO) which is carried out by first estimating the availability of system so that the most reliable nodes are selected for task transactions. In [21] an efficient task allocation approach developed using Social Spider Optimization (SSO) to maximize the system performance and reliability. A Honey Bee Optimization (HBO)–based technique proposed in [22] to solve the load-balanced transaction scheduling. The above works ignored the objectives of energy conservation.

A green energy-efficient scheduling approach [23] employs the DVFS technique which enables the computing processors to run the task at low voltages and low frequencies. It effectively utilizes the cloud data center resources through task to-VM allocation based on the task dependencies of an application to reduce energy consumption and makespan. An energy-efficient heuristic is proposed in [24] to schedule real-time workflow applications in clouds. It saves energy by effectively utilizing the schedule gaps using per-core DVFS and approximate computations. Similar work in [25] addressed energy, monetary cost, and quality of service objectives. A polynomial-time multi-objective heuristic proposed in [26] to schedule time-constrained tasks on the cloud environment, which optimize energy, cost, and resource utilization while maintaining Service Level Agreements (SLAs). Stackelberg game based Game-Score simulator developed in [27] to schedule the tasks in the cloud environment, which trade-off between energy consumption and schedule length of a workflow. A load-balancing based approach for scheduling tasks in a distributed cloud environment in [28], optimizes resource utilization by estimating task execution time based on the cloud status and queuing model used to improve response time. A PSO based multi-objective approach for workflow scheduling in clouds optimizes cost, schedule length, and resource utilization while considering system reliability [29]. From the review of the existing state-of-art scheduling strategies, it is observed that most of these strategies focus on multi-optimization without considering the reliability of the system.

3 System Architecture and Models

Our problem subsists in scheduling the workflow tasks with user-specified deadline. Our approach aims to minimize energy utilization and maximizing system reliability. To propose such an efficient technique we discussed our system architecture and models in this section. We propose system models such as the cloud datacenter model, application model, energy model, cloud system architecture, and system reliability model. For ease of insight Table 1 outlines, the primary notations with their meanings used all over in this work.

Table 1 Primary notations

3.1 Datacenter Model

The cloud datacenter model considered in this work comprises M heterogeneous physical machines (PM) \(PM =\{pm_{1}, pm_{2}, \ldots , pm_{M}\}\). All PMs supports DVFS, each of which can operate at k number of frequency(f) levels \((f_{1}, f_{2}, \ldots , f_{k})\) and the time to switch among frequencies approximately takes 10–150 \(\mu m\) which is negligible [30]. The DVFS approach operates processor frequencies under various voltage levels. Every PM is describing with different types of resources such as CPUs, memory capacity, bandwidth, network I/O, and the storage size. These physical machines virtualized into v number of virtual machines (VMs) each of which considered operating at a different frequency level (DVFS enabled) and PM operating frequency attribute to its VMs. A VM can be characterized with maximum computing performance in a Million Instructions Per Second (MIPS), bandwidth (B) etc. Hence, the cloud workflow scheduler has distinct possibilities in selecting the suitable VM to execute a task by meeting workflow constraints. Generally, we consider computing performance relates to the processor frequency. A kth virtual machine \(vm_k\) at some level of operating frequency represented as \(f_{op}^{k}\).

3.2 Application Model

In general, a workflow W with dependencies among tasks \(Tw =\{t_1, t_2, \ldots ,t_n\}\) is modeled as a DAG. A DAG \(W = (T_w, C)\) where C is the set of edges or directed arcs represents the dependencies among tasks. An edge \(c_{ij}\) is the dependency from \(t_i\) to \(t_j\) where \(t_i\) is one of the parents of \(t_j\) and \(t_j\) is one of the children of \(t_i\). A task without a parent(s) or predecessor(s) is called an entry task \(t_{entry}\) and a task without child(s) or successor(s) is called exit task \(t_{exit}\). A task is prompt to execute when all required resources of its available. When the job \(t_i\) completes its execution and its generated output transfers to its children. The data transferred (in MB) from task \(t_i\) to its child \(t_j\) has represented the weight on the edge \(c_{ij}\) as \(w(c_{ij})\). We denote the overall execution time of workflow (makespan) as \(T_M\) and associated workflow deadline as \(T_D\). For any workflow execution deadline \(T_D\) is describe as a time constraint and is user-defined.

The data communication time \(T(t_{ij})\) between two precedence constraint tasks is calculated as in Eq. (1)

$$\begin{aligned} T(t_{ij})=\frac{w(c_{ij})}{B} \end{aligned}$$
(1)

where B is the bandwidth and \(w(c_{ij })\) is the data (in MB) communicated between tasks \(t_i\) and \(t_j\). The execution cost of any task \(t_i\) calculated as in Eq. (2)

$$\begin{aligned} T(t_i, vm_k)=\frac{w_i}{f_{max}^k}+T(t_{ij}) \end{aligned}$$
(2)

where \(T(t_i, vm_k)\) is the task \(t_i\) execution time on \(vm_k\) and it includes effective execution time and data communication time. The mean execution time of task \(t_i\) is the mean of execution times on various available VMs and is calculated as follows

$$\begin{aligned} {\overline{T}}(t_i)=\sum _{k=1}^n\frac{T(t_i, vm_k)}{n} \end{aligned}$$
(3)

where n is the number of VMs. The earliest start time EST and as well as earliest finish time EFT of task \(t_i\) are calculated as follows

$$\begin{aligned} EST(t_i) =\left\{ \begin{array}{ll} 0 &{}\quad \text {if} \; t_i=t_{entry}\\ max_{t_p \in parent(t_i)}EFT(t_p) &{} \quad \text {otherwise} \end{array}\right. \end{aligned}$$
(4)
$$\begin{aligned} EFT(t_i)=EST(t_i)+T(t_i, vm_k) \end{aligned}$$
(5)

The \(EFT(t_{exit})\) is the minimum makespan of the W \(minT_M=EST(t_{exit})\). Without loss of generality, we consider user defined deadline \(T_D\) should be greater than the \(minT_M\) i.e. \(T_D> minT_M\).

3.3 Energy Model

The power utilization of computational servers of cloud data centers is due to CPU, memory, network interfaces, storage disks, and other underlying circuits. In comparison with other computing resources, CPU dominates energy consumption. The power consumption of any workflow execution on cloud infrastructure is includes frequency-independent, dynamic and static power consumption:\(P_{ind}\), \(P_{Dynamic}\), and \(P_{Static}\). The \(P_{Static}\) is the sleep or static power dissipation (to keep the clock circuit running, to maintain the basic circuits etc., which can be eliminated only by turning off the system) and it calculated as in Eq. (6),

$$\begin{aligned} P_{Static}=V*I \end{aligned}$$
(6)

where I is the current into the device and V is the supply voltage, \(P_{ind}\) is independent power consumption, which is free from CPU frequency and supply voltage. The \(P_{ind}\) consists of components such as memory, storage disks, I/O and other network devices and it can be reduced to the negligible amount by keeping the system in standby mode [31].The dynamic power dissipation \(P_{Dynamic}\) is the dominant component of energy consumption in widely attractive CMOS (Complementary Metal Oxide Semiconductor) circuits. It is the sum of the transient power consumption which is the sum of switching and through current and capacitive-load power consumption [32]. The \(P_{Dynamic}\) depends on the frequency and voltage supply of the CPU [33] and it can be estimated as in Eq. (7)

$$\begin{aligned} P_{Dynamic}=C_{eff}V^2f = C_{eff}f^3 \end{aligned}$$
(7)

where \(C_{eff}\) is the effective loading capacitance, f is the clock frequency, and V is the supply voltage. The clock frequency of CPU f is directly proportional to supply voltage V. When application runs at lower frequency levels, then supply voltage reduces linearly. Our scheduling approach minimizes this component by maximizing the application reliability. To emphasize both dynamic and static power dissipation in a system, we adopt the system power model suggested in [31] and afterward used in [30, 34] and the system power (P) consumption have given by Eq. (8).

$$\begin{aligned} \begin{aligned} P&=P_{Static}+h(P_{ind}+P_{Dynamic})\\&=P_{Static}+h(P_{ind}+C_{eff}V^2f) \end{aligned} \end{aligned}$$
(8)

where h describes the state of the system, which is 0 to specify the system is in sleeping mode and 1 for active mode. As the dynamic power utilization is the most compelling component and the other factors are ignored in this paper. Energy is defined as a product of power and time then the energy consumption when the machine is running calculated by using Eq. (9).

$$\begin{aligned} E=P*t \end{aligned}$$
(9)

The energy consumption of a specific task \(t_i\) with computation cost \(w_i\), executing on \(vm_k\) with operating frequency \(f_{max}\) calculated as follows

$$\begin{aligned} E(f_{op}^{i, k})=P_{ind}\bigg (\frac{w_{ik}}{f_{op}^{i,k}}\bigg ) + C_{eff}\Big ( f_{op}^{i,k}\Big )^2T(t_i, vm_k) \end{aligned}$$
(10)

and the total energy consumed for the application given by the sum of the energies of individual tasks in application and is calucted using Eq. (11).

$$\begin{aligned} E_{Total}=\sum _{i=1}^nE\Big (f_{op}^{i,k}\Big ) \end{aligned}$$
(11)

For simplicity, we considered only dynamic energy consumption in this work.

3.4 System Architecture

This paper aims to maximize system reliability and then reduce the energy consumption of cloud data centers. The works in [35,36,37] introduced cloud system architecture for implementing workflow applications in an energy-efficient way, and a reference architecture for Cloud Workflow Management System (CWfMS) in [5] empowers the establishment and execution of workflows. Based on these works, we introduce a system reference architecture for Energy-efficient Cloud Workflow Management System (eCWfMS) depicted in Fig. 1. The various components depicted in Fig. 1 are standard for most of CWfMSs.

Fig. 1
figure 1

Reference cloud system architecture

User Interface enables the users to set up, modify, submit, and track their applications.

Workflow Engine is the essential component of the eCWfMS and is accountable for the execution of workflow application for this it can have various sub-components such as (a) workflow Parser is responsible for converting high-level workflow descriptions such as XML to internal specifications such as objects, tasks, parameters, and dependencies which are accessed by the scheduler component. (b) cloud scheduler works with the (c) resource provisioning modules for panning the execution of the actual workflow algorithm. The overall performance of the system, reliability, and energy consumption depends on the efficiency of the scheduler. For energy-efficient and reliability aware workflow execution, it needs to interact with different components such as:

Service analyzer is to analyze the service requirements of an application, reliability and energy designer to minimize the execution cost while meeting the QoS requirements.

Service scheduler is to allocate the requests to computing resources such as leased VMs and takes the decision when to start or stop the service of VMs. It often interacts with the VM Manager to update the status of VMs and as well as to provision new VMs across the physical machines of cloud infrastructure.

Administration and monitoring toolsinclude monitoring modules for tracking the status and performance of workflow tasks, continuously and dynamically and enables leased resource management such as VMs. The collected data by these can be stored in historical databases, which can be useful for performance predictions of the system.

Cloud Information Service (CIS) provides information about different cloud service providers, their resource types, such as VM types, including pricing models. For example, Amazon EC2 giving different kinds of instances (a virtual server) for various instance families such as small, medium, large and xlarge (extra-large), the computing capabilities such as CPU computing power, memory, and storage depends on the instance type.

Cloud Service Provisioning API is the type of application programming interface, which is a gateway to enable direct or indirect cloud services to users. In particular, for scheduling problems which we considered in this work, the cloud APIs can enable provisioning and removal of the VMs on-demand and monitor the provisioned VM’s resources, storage, security, and network configuration.

3.5 System Reliability

During the execution of an application, faults may occur due to hardware breakdown, software failures, cosmic ray radiation, etc. As the frequency of transient faults significant than permanent and intermittent faults [30, 33], we focused on transient faults in this paper. Cloud provides sharable heterogeneous computing infrastructure through the internet. The computing resources (in particular processors) failures are inevitable and have conflicting effects on application performance and energy consumption. These failures of processors are discrete events and assumed to follow a Poisson process [38]. The operating frequency of CPU influence the fault arrival rate \(\lambda\) [30]. This fault arrival rate \(\lambda\) influences the performance of a node where a computation-intensive application running and hence the reliability of such node is essential.

We assume that the transient faults happen while individual tasks are in execution. However, with the effect of DVFS, the systems operating frequency can influence the error arrival rate and its corresponding supply voltage. Therefore, the fault rate is represented as in [33] is given in Eq. (12)

$$\begin{aligned} \lambda (f_{op}^{k})=\lambda _0.g(f_{op}^{k}) \end{aligned}$$
(12)

where \(\lambda _0\) is the initial fault arrival rate at \(f_{max}^k\), \(f_{op}^k\) is operating frequency of \(vm_k\), \(g(f_{op}^{k})\) is a decreasing function and \(g(f_{max}^{k})=1\). Generally, Eq. (12) is known as an exponential relation between the \(\lambda\) and the circuit’s critical cost. In our scheduling approach and experimental analysis, we assume the model proposed in [33] and afterward used in [30] expressed as exponential model as in Eq. (13)

$$\begin{aligned} \lambda (f_{op}^k)=\lambda _0g(f_{op}^k)=\lambda _0.10^\frac{d(1-f_{op}^k)}{1-f_{min}^k} \end{aligned}$$
(13)

where \(\lambda _0\), \(f_{op}^k\) and \(g(f_{op}^k)\) are same as mentioned before. The positive constant d stands for the faulty rate dependency on frequency scaling and corresponding voltage. It can be easy to perceive exponential increase in \(\lambda\) with frequency scaling for energy saving, i.e. \(\lambda\) is maximum at the minimum allowed CPU frequency.

$$\begin{aligned} \lambda _{max}=\lambda _0.10^d, \quad \text {for} \quad f_{op}^k=f_{min}^k \end{aligned}$$
(14)

Considering the transient fault model in [30, 33], which follows Poisson distribution model, the reliability R of a task \(t_i\) running on \(vm_k\) is calculated as follows

$$\begin{aligned} R(f_{op}^{i,k})=e^{-\lambda (f_{op}^k).\frac{T(t_i, vm_k)}{f_{op}^k}} \end{aligned}$$
(15)

where \(f_{op}^{i,k}\) is the operating frequency of the node where task \(t_i\) running. The reliability of application W with n number of tasks is the product of individual task reliability

$$\begin{aligned} R_W=\prod _{i=1}^nR(f_{op}^{i,k}) \end{aligned}$$
(16)

4 Proposed Algorithm

In this section, we discussed our proposed approach for workflow scheduling in a cloud environment. The proposed EERS approach is capable of minimizing energy consumption and maximizing system reliability while meeting the user-defined deadline. It includes four sub-algorithms, such as task rank algorithm, task clustering algorithm, sub-target time distribution algorithm, and cluster-VM mapping algorithm, which reduce energy consumption and maximize the system reliability. This section presents an implementation of these sub-algorithms precisely.

4.1 Task Rank Calculation Algorithm

Workflow tasks rank order is established in this stage to fulfill the requirement of task scheduling. The task ranks are established in such a way to meet the precedence constraints and finds a topological order for scheduling. To prioritize tasks in W without disturbing dependencies, each task \(t_i\) is assigned a rank \(rank(t_i)\), that can be computed recursively staring with the exit task \(t_{exit}\) [30, 34] as follows

Step-1 The exit tasks rank initialized to its average computing time

$$\begin{aligned} rank(t_{exit})={\overline{T}}(t_{exit}) \end{aligned}$$
(17)

Step-2 For each task compute rank recursively according to the following expression

$$\begin{aligned} rank(t_{i})={\overline{T}}(t_i) + max_{t_j \in child(t_i)}rank(t_j) \end{aligned}$$
(18)

where \({\overline{T}}(t_i)\) is the average computation time of \(t_i\) on different VMs. Estimate ranks of all tasks by repeating the above steps for each task in workflow.

4.2 Task Clustering Algorithm

Once the parent task completes execution, its generated output transfers to its child tasks. If both are scheduled on different VMs, then communication energy consumption incurs. A large amount of data communication energy is consumed during inter-processor communication. We can avoid this energy consumption by grouping the tasks and then schedule on the same machine. Consider three tasks \(t_i\), \(t_{i+1}\) and \(t_{i+2}\) with the dependency shown in Fig. 2, the task \(t_{i+2}\) has two parents, \(t_i\) and \(t_{i+1}\) with communication costs 3 and 5, respectively. Communication energy can be saved by grouping \(t_{i+1}\) and \(t_{i+2}\) to schedule on the same VM.

Fig. 2
figure 2

Clustering of tasks

We adopt the clustering approach in [30] for this paper and we extended it to minimize communication energy as follows:

Step-1 Staring from the entry task \(t_i=t_{entry}\), for every task \(t_i\), if task \(t_i\) not yet earmark for any cluster, then make a new cluster \(cl_l\)

Step-1a add \(t_i\) to \(cl_l\) and sort the children of \(t_i\)

Step-2 For each child \(t_j\) of \(t_i\), if cluster not been assigned and parents are assigned to a cluster

Step-2a if a task \(t_j\) has more than one parent, then check its parents which transfers more data to it. Such parent can find as follows

$$\begin{aligned} t_k= max_{t_k \in parent(t_j)}w_{kj}, \end{aligned}$$
(19)

\(\quad \text {if}\quad EFT(t_j) \le EST(t_i) \quad \text {then}\quad t_i \leftarrow t_k\) ,\(\quad \text {goto}\quad \text {Step2}\)

Step-2b \(t_i \leftarrow t_j\)

Repeat Step 2 for entire graph W, until each task assigned to some cluster.

4.3 Sub-target Time Distribution Algorithm

The target time to complete each task is based on the \(T_D\) is distributed to each task of the workflow. It is a proportionate increase in the effective execution time (makespan) of individual tasks and hence the application. This makespan extension can be done by reducing the frequency of the processor where it is running; hence, we can save energy. To implement sub-target time distribution we suggested a simple algorithm that can compute in polynomial time and steps for sub-target time distribution algorithm are as follows

Step-1 Each task \(t_i\) in a given workflow, calculate the sub-target time using Eq. (20). The sub-target time is the deadline to complete execution of task \(t_i\).

$$\begin{aligned} T_{sub-target}(t_i)=T(t_i).\frac{T_D}{minT_M} \end{aligned}$$
(20)

where \(T(t_i)\) is the effective execution time of task \(t_i\), \(T_D\) and \(minT_M\) are deadline and minimum makespan of application respectively.

Step-2 For each task \(t_i\) update EST and EFT using Eqs. (4) and (5), respectively.

Every task is set with a new deadline or target completion time by repeating the above steps for each task in a workflow.

4.4 Cluster-VM Mapping Algorithm

Different from the clusters and utility grid computing environment, in a cloud environment, as long as the service available, then the cloud scheduling approach can provide a time-slot to map the task to avail the service [37]. However, the cloud has inexhaustible resources, and it can offer VMs with different characteristics for users. But it is not always good to meet several optimization constraints. Therefore, in this section, we propose cluster-VM mapping to execute cluster tasks to maximize system reliability and minimize energy consumption. The steps to select a more appropriate VM for a cluster to execute its tasks as follows.

Step-1 For each task \(t_i\) (where \(t_i\in cl_l\)), calculate the optimal frequency for energy conservation on different available VMs using the Eq. (10).

Step-2 Calculate reliability of each task \(t_i\) (where \(t_i\in cl_l\)), on each VM using Eq. (15)

Step-3 Map the tasks of cluster \(cl_l\) to the highest reliable VM \(vm_{k}^{opt}\) to complete its execution; we denote it as \(cl_l \rightarrow vm_{k}^{opt}\).

Step-4 If \(vm_{k}^{opt}\) is idle or if it executes in communication mode then scale down its operating frequency to its minimum i.e. \(f_{op}^{i,k}=f_{min}^{i,k}\)

We can map each cluster to the most suitable VM for energy conservation and maximize task reliability by repeating the above steps for all the clusters.

4.5 Slack Algorithm

Consumption of electrical energy has developed into one of the primary interests of the cloud-data centers. In the context of workflow application scheduling, there is some idle time slots associated with VMs (slack time) while executing non-critical tasks. We can redeem this slack associated with non-critical tasks by scaling the supply voltage and frequency of the task, to conserve energy [37, 39]. First, we need to estimate the latest start time (LST) of each task before we introduce the slacking algorithm. The LST of task \(t_i \; LST(t_i)\) is specified as follows

$$\begin{aligned} LST(t_i) =\left\{ \begin{array}{ll} T_D-T(t_{entry}), &{}\quad \text {for} \;t_i =t_{exit},\\ min_{t_p \in child(t_i)}(LST(t_p)-T(t_p)), &{} \quad \text {otherwise} \end{array}\right. \end{aligned}$$
(21)

The slack time of task \(t_i\) is computed by

$$\begin{aligned} T_{slack}(t_i)=LST(t_i)-EST(t_i) \end{aligned}$$
(22)

A critical task \(t_i\) has no room to reclaim i.e. \(T_{slack}(t_i) =0\), and for non-critical tasks \(T_{slack}(t_i)>0\). To reduce the frequency \(f_{op}^{i,k}\) of a non critical task \(t_i\) to conserve energy, we used the following four steps:

Step-1 Calculate the slack time of the task \(t_i\) using Eq. (22) and mark all critical tasks as “defined” as no idle slot to reclaim.

Step-2 Select a task \(t_i\) to change the frequency, which has the longest \(T_{path}\), and its parents are marked “defined”, where \(T_{path}\) is the sum of the execution time of the tasks on the path from \(t_i\) to the “defined” task.

Step-3 Reduce task frequency to extend the execution time and save energy as follows

$$\begin{aligned} f_{op}^{i,k}=\Big (f_{max}^{i,k}\Big ).\frac{T_{path}}{T_{path}+T_{slack}(t_i)} \end{aligned}$$
(23)

Step-4 Update execution time of \(t_i\) and mark it “defined”.

Repeat the above steps for entire DAG W until all tasks are marked “defined”.

4.6 EERS Algorithm

We propose the EERS algorithm, which enables the cloud scheduler to get through less energy cost to complete an application and also maximizes system reliability while meeting the user-defined deadline. The EERS algorithm comprises five sub-algorithms which were introduced in the above sections. We schedule a task to a specified VM when all of its predecessors finishes i.e a task becomes schedulable when all of its predecessors complete their execution. When a current task completes execution then its successor tasks become schedulable. We specified a sub-deadline for every task before mapping it to the most appropriate VM. Thus, each task can be complete its execution within its target time, and the entire application can be completed within a specified deadline.

Fig. 3
figure 3

The pseudo-code of the EERS algorithm

The pseudo-code for our EERS algorithm is presented in Fig. 3. Firstly, the EERS algorithm calls the task rank calculation algorithm at line 2 in Fig. 3, to realize a reasonable order of tasks to execute without loss of precedence constraints of workflow. At line 3 in Fig. 3, the task clustering algorithm is called to minimize the communication cost which reduces energy consumption. Then sub-target time distribution algorithm for energy conservation by decreasing the task frequency while meeting user-defined quality parameter(deadline) in line 4. The cluster-VM mapping algorithm to select optimal VM to conserve energy and maximize task reliability is called at line 6 in Fig. 3. Finally, a slack algorithm is called at line 7 in Fig. 3 to reclaim the slack of non-critical tasks to further reduce the energy consumption. The clustering algorithm, task-VM mapping algorithm, and task slacking algorithm reduce energy consumption without compromising the performance.

5 Performance Evaluations

We discussed the performance of the EERS workflow scheduling technique in this section. By conducting a series of simulation runs, we evaluated our proposed algorithm. We assume that the cloud data center has DVFS enabled virtual machines (VMs) and every VM has its computing resources, and bandwidth is constant between VMs instances. The computing performance of VMs and other simulation parameters present in Table 2 and are self-defined. The other parameter such as failure rates are \(10^{-5}\) to \(10^{-7}\) failures/s as in [30]. For ease, we suppose that the frequency of VM is directly proportional to its computing performance, which is realizable from the experimental point of view. Moreover, every VM operates at different levels of the frequency with minimum and maximum thresholds, and the DVFS takes advantage of these frequency levels to scale the task’s operational frequency to conserve energy.

Fig. 4
figure 4

CyberShake

Fig. 5
figure 5

Montage

We used WorkflowSim [40] is a toolkit to simulate the cloud environment, which permits the IaaS cloud to model and simulate. The IaaS cloud provides a virtualized computing environment (VMs) to execute workflow applications. We choose real-world scientific workflows such as CyberShake and Montage to evaluate our proposed EERS algorithm. Cybershake is characterized as a memory and data-intensive earthquake science application. This model was used in Sothern California to study Seismic Hazards. It used physics-based 3-D ground simulations to study the seismic wave propagation effects. Montage is an I/O-intensive astronomy application developed for scientific research. This toolkit enables astronomers for assembling sky images in Flexible Image Transport System (FITS) within custom mosaics. Usually, very large scale datasets used in scientific workflow applications. Each workflow has distinct topological structures. The topological structure of the CyberShake and the Montage workflows are shown in Figs. 4 and 5 respectively. Moreover, each one has its data requirements and computational characteristics [15, 41].

For experiments, purpose, we assumed that each task’s workload of the workflow ranges from 1000 to 3000 MIPS, and the output generated data volume ranges between 100 and 1000 GB. Due to the complex and massive data volume of workflow applications, the deadline for workflow is over 10 h [37].

Table 2 Simulation environment parameters

In our experiments, the main focus is on reliability and energy consumption. These will change with the varying workloads (number of tasks in workflow) and a varying number of VMs to execute the selected workload. To expose the strength of our proposed scheduling algorithm on reducing energy consumption and assured reliability the simulation results were compared with other existing works. We perform three well-known scheduling approaches HEFT [11], EES [39], and REEWS [30] to compare with our algorithm. The HEFT approach is a prominent list-based heuristic for workflow application scheduling to optimizing the makespan. In every step, HEFT chooses the highest priority (rank value) task to assign a processor, which reduces the earliest finish time (EFT) with an insertion-based approach. The Enhanced Energy-efficient Scheduling (EES) algorithm is a HEFT-based approach to conserve energy while meeting the quality parameters. The fundamental idea of the EES method exploits the slack time on non-critical tasks and globally allocates them to minimize energy. The REEWS is a heuristic algorithm to maximize the application reliability and minimize energy consumption while meeting user-defined quality constraints. It outperformed RHEFT [18] in reliability and PALS [42] in energy saving. REEWS works in four stages: (1) task priority calculation to preserve dependencies; (2) task clustering to minimize communication cost; (3) distribution of target time (user-defined deadline); and (4) mapping the cluster to VM with appropriate frequency/voltage levels.

5.1 Performance Evaluation with Different Workloads

First, we evaluated the performance of our algorithm for different workloads i.e by varying the number of tasks as 30, 50, 75, 100, 150, and 200 on CyberShake and 75, 100, 125, 150, 175, and 200 on Montage real-world scientific workflows. The simulation results for energy consumption and system reliability are depicted in Figs. 6 and 7 respectively. In energy objectives, our proposed EERS consistently reduced energy concerning the workloads. It saved more energy as compared to the HEFT, EES, REEWS as the fact that the EERS is efficient in allocating resources to the tasks in such a way to reduce energy consumption. Moreover, EERS efficient in clustering the tasks to reduce the communication costs and hence saves energy. Finally, the EERS used a task reclaiming approach to take the advantage of the DVFS techniques to reduce energy utilization by lowering the task’s supply voltage and frequency. The REEWS algorithm is an efficient technique in maximizing the system reliability which outperformed the state-of-the-art techniques such as RHEFT and PALS in reliability objective. Our proposed EERS gives better reliability with HEFT and on par with REEWS but in every case, our proposed EERS approach consumed less energy compared with the other approaches.

Fig. 6
figure 6

a Energy consumption and b Reliability for various workloads on CyberShake Workflow

Fig. 7
figure 7

a Energy consumption and b Reliability for diffrent number of VMs on CyberShake Workflow

5.2 Performance Evaluation with Different Number of VMs

Further, we evaluated the performance of the EERS approach for different numbers of VMs on both CyberShake and Montage workflows. We considered 5, 10, 15, 20, and 25 VMs on CyberShake and 15, 18, 20, 24, and 28 VMs on Montage. The simulation results for energy consumption and system reliability on CyberShake and Montage workflows are depicted in Figs. 8 and 9 respectively. Based on energy consumption and reliability the EERS algorithm efficient in selecting the number of VMs and as well as the type of VMs. Hence, in this case, our EERS algorithm saved more energy on both workflows compared to other approaches. With varying numbers of processors/VMs also our approach maintained good reliability this is because the EERS algorithm considers failure rate before selecting resources for mapping tasks to that resources.

Fig. 8
figure 8

a Energy consumption and b reliability for various workloads on Montage Workflow

Fig. 9
figure 9

a Energy consumption and b reliability for diffrent number of VMs on Montage Workflow

6 Conclusion

Recently, the need for energy conservation and maximizing system reliability has become important research. In this work, we considered scientific real-world workflows to schedule in the cloud environment. We present Energy Efficient and Reliability-Aware Scheduler (EERS) for scheduling workflow tasks in Cloud Computing to minimize energy consumption and maximize the application reliability while satisfying user-defined deadline constraints. Our proposed EERS approach comprise of five sub-algorithms (1) task rank calculation algorithm, (2) task clustering algorithm, (3) Sub-target time distribution algorithm (4) cluster-VM mapping algorithm, and (5) slack algorithm. We discussed our EERS performance in Sect. 5 by performed considerable simulation runs on the WorkflowSim toolkit and evaluated our algorithm on real-world scientific workflows CyberShake and montage for different numbers of virtual machines and different workloads. We compared the performance of our EERS algorithm with popular workflow scheduling techniques such as HEFT, EES, and REEWS. It was observed from the simulation experimental results that our proposed approach consumed less energy with maximizing the system reliability in all the cases. We can conclude that the time complexity of the proposed sub-algorithms is polynomial. Simulation experiments outcome reveal that our EERS approach surpasses other algorithms in both energy consumption and reliability.

As a subsequent work, we will examine the actual electricity costs for workflow tasks scheduling, and further, we will incorporate the monetary cost constraints and frequency-independent energy consumption.