1 Introduction

Large-scale distributed computing systems (LDCSs), such as grids and clouds, have emerged as an essential infrastructure for large-scale and resource-intensive applications, particularly in scientific and engineering domains. An important class of applications, referred to as workflow, is characterized by having each job made up of tasks with temporal constraints among the tasks, arising for example because some task needs data or control information produced in earlier tasks of the same job. For example, scientific workflows in bioinformatics [1] and earthquake studies [2], involve the orchestration of serial short-duration tasks in data collecting, processing and transferring. Real-time weather prediction modeling [3] is an application with, in addition, strict deadlines on task completion.

LDCSs allow great amounts of heterogeneous computing resources, which are pervasive in our environment, be harnessed and managed cooperatively for a common objective. LDCSs provide a cost-effective way that uses existing distributed resources to support a range of computing applications. However, the efficiency of these systems has become ever serious with the poor utilization of 10 % or even lower in many cases [4].

There have been many studies on workflow scheduling in LDCSs, e.g., [511]. The most popular ones are list scheduling heuristics, which first rank tasks into a priority list based on the data dependency, and then select resources for each of them one after another. A typical example is heterogeneous earliest-finish-time (HEFT) [5], which schedules each task to the resource that can finish it in the earliest time. The approach was first designed mainly for tightly coupled heterogeneous computing environments but it was also applied to loosely coupled systems, such as ASKALON [12]. Several variants of list heuristics were proposed later, such as group (or level based) heuristics [6], and critical path-based heuristics [7, 8]. When future tasks of a job are allocated to be run on a resource at a particular time, we say that an advance reservation has been made; this is a crucial resource provisioning mechanism adopted by many scheduling algorithms [1316], especially when QoS guarantees are required [17]. Nevertheless, they mostly deal with a single workflow at a time without paying much attention to performance uncertainties.

In this paper, we address the problem of scheduling multiple, concurrent workflow applications, explicitly taking into account resource efficiency and scheduling robustness. More specifically, we study how to efficiently utilize heterogeneous resources when multiple workflows are to be concurrently scheduled in the system. This scheduling is much more complicated compared to traditional single workflow scheduling due primarily to the fact that resources are heterogeneous and shared. Also, to our best knowledge, there is no existent method to get the accurate performance information on workflows, such as execution times of tasks and data transfer times. These factors have motivated us to design scheduling algorithms that effectively deal with performance uncertainties using rescheduling and task rearrangement techniques.

This paper significantly extends our previous work [18] with the consideration of performance uncertainty. Performance uncertainty is especially damaging in workflow scheduling, since a delay in finishing one task can propagate widely, causing delays among many jobs and on many resources. For example, if a task in a job finishes later than expected, the delay may propagate not only to its successor tasks, but also to tasks of other jobs assigned to those resources of the task and its successor tasks. If nothing is done to reconsider the schedule, the overall execution will be greatly damaged.

We propose a novel approach of reservation adjustment that applies our rescheduling algorithm [18] to deal with delay arising from inaccuracy in performance information. Our approach explicitly takes into account temporal constraints of (scheduled) workflow tasks and it determines whether influenced reservations could be rearranged with tasks of other workflow applications. This rearrangement will allow all tasks run on an appropriate resource instance and thus limiting the delays to scheduled applications.

Extensive experiments have been conducted to study the performance of our scheduling approach under different degrees of prediction accuracy. Results show that our reservation rearrangement algorithm effectively reduces the propagation of postponement among reservations; therefore, scheduling robustness improves.

The rest of this paper is organized as follows. Section 2 describes the problem of multiple-workflow scheduling. In Sect. 3, we present our rescheduling with a graph search algorithm. Section 4 shows how our rescheduling solution is extended to deal with inaccurate performance prediction. Performance evaluation results are presented in Sect. 5 followed by related work in Sect. 6. We then draw our conclusion in Sect. 7.

2 Problem description

In this section, we describe the problem of multiple-workflow scheduling. We first give a brief description of system model and application model, respectively, and then illustrate two specific issues in our problem: resource efficiency and scheduling robustness.

2.1 System model

The target system in this study is viewed as a set of heterogeneous computing/resource sites, each of which consists of a set of resources. Resource sites are loosely coupled, for example through a wide area network with various network bandwidths (as shown in Fig. 1). More formally,

$$\begin{aligned} S = \mathrm{{ }}\left\{ {{R_\mathrm{{1}}},{R_\mathrm{{2}}},\mathrm{{ }} \ldots ,{R_{{n}}}} \right\} \end{aligned}$$

where \(R_{i}\) is an autonomous resource site that can take computing tasks independently. Resources vary in computing capacity (the number of processors, the size of memory and hard disk), computing power (CPU frequency and internal network bandwidth) and other resource properties (such as hardware platform, operating system or services provided). A resource can run several tasks concurrently if the availability and computing capacity allow.

Fig. 1
figure 1

A large-scale distributed system

2.2 Workflow application

A workflow application/job consists of precedence-constrained tasks with data dependencies. The execution of these tasks is coordinated in the way that precedence constraints are respected. A workflow job can be represented by a directed acyclic graph (DAG), \(G\) = (\(V, E)\), where \(V\) is a set of tasks and \(E\) is a set of directed edges representing precedence constraints between corresponding tasks. Figure 2 gives two simple examples of DAGs.

Fig. 2
figure 2

Two simple DAGs. a Job A with five tasks. b Job B with four tasks

For a given edge from task \(v\) to task \(w\), \(v\) is defined as \(w\)’s predecessor, and \(w\) is defined as \(v\)’s successor. A task without any predecessors is defined as entry task and a task without any successors is defined as exit task. A task is regarded as ready to run (or simply as a ‘ready task’) when all its predecessor tasks have completed and transmitted the information needed. Thus, the readiness of each task is determined by its predecessors, more specifically the one that completes the communication at the latest time. More formally, the earliest start time (EST) of a task is defined as:

$$\begin{aligned} \mathrm{EST}(t,r)&= \max \{\mathrm{FT}(t_i )+{\mathrm{data}(t_i ,t)} /{\mathrm{bandwidth}(r_i ,r)}: t_i \in \mathrm{predecessor}(t)\text{, } \\&r_i =\mathrm{resource}(t_i )\}, \end{aligned}$$

where FT is the function giving the expected finish time for each task, and communication (data transmission) time is calculated from the data size and network bandwidth between corresponding resources.

A task with multiple parent tasks requires synchronization. This synchronization means that each task has a latest finish time (LFT) calculated by the need to provide information to each successor in time for that successor to run:

$$\begin{aligned} \mathrm{LFT}(t,r)&= \min \{\mathrm{ST}(t_j )-{\mathrm{data}(t,t_j )}/{\mathrm{bandwidth}(r_j ,r)}:t_j \in \mathrm{successor}(t),\\&r_j =\mathrm{resource}(t_j )\} \end{aligned}$$

where ST is the expected start time. We assume that the structure of a given workflow is known at the time of job submission, and the execution time of a task can be estimated using performance modeling/prediction techniques, e.g., [20, 21].

2.3 Resource efficiency when scheduling multiple workflows

When a new job arrives, the scheduler decides on where (on which resource) and when (at which time) each of its tasks is to be executed; that is, an advance resource reservation (or advance reservation for short) is made for each task for the resource of the need to have capacity available at the appropriate period. Note that when one workflow job is scheduled, there are often some tasks of previously submitted jobs waiting for their predecessors to finish. This paper studies cases where previous reservations should be changed as the execution proceeds.

Now, consider two workflow jobs A and B, as shown in Fig. 2, submitted successively to a scheduler for execution over resources R1, R2 and R3. Suppose that from consideration of critical path [7, 8] and estimates of duration of each task and inter-task communication needs, the scheduler allocates tasks as shown in Fig. 3a. The system makes advance reservations for these tasks. Then, job B arrives. Due to the resources that have been reserved for job A, task B3 has to be scheduled after task A2. The best scheduling that can be done without changing the reservations for task A is shown by the solid rectangles in Fig. 3b.

Fig. 3
figure 3

Task rescheduling to optimize resource allocations. a The initial scheduling of job A. b Scheduling job B with regard to job A. c Optimized scheduling after rescheduling task A2

However, if task A2 could be rescheduled to a later time slot on another resource, such as the dashed rectangle in Fig. 3b, then task B3 could start earlier and job B could finish earlier. An optimized resource allocation is shown in Fig. 3c. Clearly, although the first schedule of task A2 makes it finish at the earliest time, it becomes a bad choice when job B arrives. A rearrangement of scheduled tasks can optimize the resource allocation for both jobs.

2.4 Broken reservations and delay propagation

An advance reservation is made based on predictions of task execution time and data transmission time. However, such a reservation may not be always kept as the actual task execution and/or data transfer may differ from the estimation. We describe a reservation that cannot be kept as a broken reservation. A broken reservation may propagate to other reservations due to precedence constraints. In other words, for a particular broken reservation, if we simply prolong it until the resource becomes available, other reservations may further need changing. Thus, the broken reservations must be adjusted if we are to minimize performance degradation such as an increase of makespan. A more effective way to deal with this issue of broken reservation is reservation rearrangement that considers the other tasks in the system. Figure 4 illustrates our rearrangement scenario.

Fig. 4
figure 4

Task rescheduling to avoid unnecessary postponement

Suppose there are tasks scheduled on distributed resource sites and the advance reservations are shown as Fig. 4a. Among them, task X1, X2 and X3 belong to a workflow job and need to be executed sequentially. At time \(t_{1}\), a delay in the execution of X1 is detected. This delay is expected to influence its successor task X2, and the delay may further pass down to more tasks such as Z1 and X3. However, these delays can be avoided if task X2 is rescheduled to resource R3 as shown in Fig. 4b.

3 Rescheduling algorithm

For both issues described above, resource efficiency for multiple workflows and scheduling robustness with broken reservations, we adopt an adaptive rescheduling mechanism to improve performance or reduce the propagation of delays. In this section, we present our rescheduling algorithm. Then, in the next section, we will extend our algorithm with the consideration of broken reservations.

3.1 Task-resource matching

If we take tasks and resources as two disjoint sets, then a scheduling problem becomes a bipartite matching problem.

For example in Fig. 5a, nodes in part T represent tasks (each belongs to a particular workflow) and nodes in part R represent resources. For a given task, a resource is defined to be satisfiable if it meets the task’s resource requirements including computing and storage capacities. We make every task linked with all its satisfiable resources in the bipartite graph. So, T1 has one satisfiable resource: R1, and T2 has two: R1 and R2. An arrow on the line pointing to the task represents a case where the resource is actually allocated to that task (that is, the appropriate amount of CPUs and memory is reserved for the task for a specific period of time); the arrow points away from the task otherwise. In the figure, it shows that T1 and T2 are scheduled on R1 and R2, respectively, and, T3 is not scheduled at the moment.

Fig. 5
figure 5

An example of resource allocation rearrangement

To schedule T3, we need to check all its satisfiable resources (R2 in our example) to get an appropriate time slot during which available resources can be reserved to meet T3’s requirements. However, if R2 is already too committed for the time period involved, then we cannot schedule T3 without further changes. Thus, the workflow to which T3 belongs would need to be either rejected for the execution or the deadline would be relaxed. Instead, we propose to try to rearrange existing reservations so that we can place T3 on R2. We scrutinize reservations on R2 if any task can be rescheduled to satisfy T3’s temporal constraint. In Fig. 5a, we assume that T2 can be removed from R2 for this purpose. Then, the problem becomes whether T2 can be rescheduled successfully on another resource instance or time slot without violating its temporal constraint. In our simple case, we assume that R1 has enough resource for both T1 and T2 and then no more tasks need be removed. This path of rescheduling tasks can be shown as the dash lines in Fig. 5b. By reversing the direction of these dash lines, we get a new matching in Fig. 5c, where T3 is scheduled on R2 and T2 shares resource R1 with T1.

3.2 Dynamic search tree

A traditional bipartite graph matching algorithm [22] can find the maximal one-to-one matching in a bipartite graph. However our problem is more complicated. First, task scheduling involves many-to-one matching. For example, T2 can share resource R1 with T1. Second, we cannot simply determine whether a task can be matched with (or scheduled on) a resource in advance: temporal constraints and resource availability must be considered while these depend on the scheduling of other tasks. Any task rescheduling will leave different resource availability to others and different temporal constraints to its predecessors and successors.

We develop a new match searching algorithm based on a dynamic search tree. Our algorithm adopts the basic ideas from the standard bipartite matching algorithm: rearranging the matches along an augmenting path, which alternates edges that are used in the matching and that are not used (as the dash lines shown in Fig. 5b). In searching for the augmenting path, each step along an edge means that a task is assigned to a resource by removing another task from it. The removed task then triggers the next step of replacement. Theoretically, we can remove several tasks from a resource to make enough room for others. To simplify the computation, we restrict that only one task is replaced by another in each step of rearrangement that we consider. We define a tuple:

$$\begin{aligned} \langle t1, r, t2 \rangle \end{aligned}$$

to represent where a task \(t1\) can be scheduled on one of its satisfiable resources \(r\) by replacing another task \(t2\) that is currently scheduled to \(r\). We call \(t1\) the removed task (it is removed from scheduling in the last replacement), \(t2\) the replaceable task (it will be the removed task in the next step), and \(r\) the satisfiable resource. For any removed task, there may be many possibilities of such replacement through different satisfiable resources and alternative tasks on a resource. The algorithm goes through each of them until any task can be rescheduled without replacing any other tasks.

Our search begins from the task that needs to be scheduled. It can be regarded as a search tree (as shown in Fig. 6) that grows down from removed tasks to their children: replaceable tasks. The replaceable tasks then become removed tasks and the search tree extends to one lower level of replaceable tasks. A search path along the tree represents a sequence of rearrangements in matching. For example, the bold lines in the figure represent that T0 can be scheduled by replacing T1 on the resource where T1 is currently matched, and T1 can be rescheduled by replacing T2 on the resource where T2 is currently matched. If T2 is rescheduled without replacing any other task, this path will be an augmenting path that we are searching for.

Fig. 6
figure 6

A search tree for augmenting path (a serial of rearrangements)

The search tree is built up dynamically during the search process. When the search reaches a task node along a path, we calculate the current status of resource availability (after a sequence of rearrangements among the concerned resources on the path) and the task’s time slot boundary according to the current scheduling of its predecessors and successors. Then, we determine whether the task can be rescheduled on an available resource, if not, we determine which tasks can be replaced in the next step to make room. The search tree is growing level by level as the search goes deep. When it reaches the deepest task along a branch, the search tracks back to the upper level. At the moment, the rearrangement along the retreated path must also be rolled back, and then search continues after a new level of task nodes is calculated out along another branch.

3.3 Search algorithm

Our rescheduling algorithm (Fig. 7) adopts a depth-first-search in the process to build the search tree and to look for an augmenting path. In each step of search, the algorithm first checks whether it is going deeper along a path or backtracking to an upper level (line 7). If it is backtracking, the rearrangements along the retreated path are also rolled back (lines 9 and 10) to restore the former scheduling status. Search path goes forward by task replacing in line 12, and new scheduling is calculated according to the current status of task-resource matching.

Fig. 7
figure 7

Task rescheduling algorithm

In our algorithm, we restrict the reschedulings we offer for a task to those that are within its time slot boundary (between EST and LFT), so that we avoid rescheduling the entire workflow (more precisely, predecessors and successors of the task). For example, the rescheduling of task A2 in Fig. 3 does not require task A1 or A5 to also be rescheduled for the temporal constraints.

We adopt a heuristic to trim the search tree so that the time complexity is affordable. We restrict that each task appears at most once in the search tree. Thus, in the worst case, our search goes through \(n\) nodes, where \(n\) is the number of tasks scheduled in the system. In each step of search, all the satisfiable resources of a task are checked for rescheduling; hence, the time complexity is \(O\)(sn), where \(s\) is the average number of satisfiable resources for each task.

4 Saving broken reservations

Advance reservation is an important resource provisioning mechanism widely adopted in many scheduling algorithms [1317]. Our algorithm makes resource reservations for scheduled tasks that may belong to multiple workflows and improves resource efficiency by rearranging reservations (or in other words, rescheduling) when new tasks arrive. However, the choice of advance reservation is based on the prediction of task execution time and data transmission time, which are often not accurate. Broken reservations are rather common and should be explicitly dealt with to keep scheduling consistency in the system.

4.1 Reservation compatibility in rescheduling

Inaccurate performance prediction, specifically underestimated running time, will produce delays in task execution. These unexpected delays may use resources that have been reserved for other tasks and, therefore, break those reservations. One possible approach to resolve the problem could be by postponing the start time of all broken reservations (including those broken by broken reservations) by a fixed time, for example, introducing one hour postponement if delay is one hour. Rather, we advocate the rescheduling of the task, as described in Sect. 2.4, to stop postponement propagating to more and more reservations.

There is one issue that must be considered when rescheduling to save broken reservations: the rescheduled task must not induce any extra postponement to other reservations, because that may offset the benefit we obtain from rescheduling. We illustrate the issue with an example in Fig. 8.

Fig. 8
figure 8

An example of task assignment in a resource site. a Resource reservations for task T1 and T2. b Reservation for task T2 is postponed. c Resource allocation if task T3 is rescheduled on the resource

Consider a resource instance which is reserved for two tasks as shown in Fig. 8a. The reservations are compatible as the total required computation capacity, such as the size of memory and the number of processors, meets the overall capacity of the resource instance. Now suppose that a delay in execution breaks a list of reservations among these being the reservation for task T2. To save these reservations, we try to reschedule corresponding tasks. However, suppose that rescheduling of T2 failed and the reservation has to be postponed for a certain time, as shown in Fig. 8b. Then, we continue to reschedule other tasks in the list and our algorithm may find that task T3 could be rescheduled on the resource as shown in Fig. 8c. Although the reservations stay fit within the capacity, it could become a problem if reservation for task T1 would be postponed after this rescheduling: that would force T2’s reservation to be postponed again. This is because the rescheduled task T3 used the resource capacity that originally can run tasks T1 and T2 concurrently. Delays are quite possible from inaccurate performance prediction. We must avoid excessive postponement occurring to one reservation because it compromises the benefit we obtain from rescheduling.

To avoid such situations, we revise our algorithm in checking whether a rescheduled task is compatible with all others on the resource instance. We first hypothetically postpone all un-postponed reservations in the concerned time period and calculate whether the resource allocation is still compatible. In the above example of Fig. 8, we first hypothetically postpone task T1 by one unit of time, and then we will find that the resource allocation would exceed the capacity if task T3 would be rescheduled on it. So, we consider that rescheduling T3 as illustrated would not be compatible, and we must search for another resource instance or time slot for task T3 in the rescheduling. If the resource allocation is compatible when all un-postponed reservations are hypothetically postponed, it must be compatible when arbitrary reservations are really postponed after the rescheduling.

4.2 Algorithm description

Our rescheduling algorithm is extended to deal with broken reservations. We adopt periodic polling to monitor the actual progress in task execution and data transmission. While earlier completion of task execution and data transmission can be easily dealt with existing techniques, such as back-fill algorithms [23, 24], our focus is the counter case when such completion is (being) delayed, particularly affecting other reservations.

At a polling point, if a delay in task execution is detected we need to extend the length of reservation; if a delay in data transmission is detected we have to postpone the start time of the task. We set the time length (in extending or postponing reservation) equal to the interval of two polling points. If the delay is not over within that time, it will be detected and adjusted again at the next polling point. A higher polling rate would make these adjustments in task finish/start time more accurate, but with the expense of higher overhead on rescheduling.

After detecting all delays in task start and finish time, we calculate which reservations are influenced by these delays. The influenced reservations are collected into two lists: one (PList) is for those that can only be postponed, such as the ones that have received data at the resource site (we do not consider to re-transfer data in rescheduling), and the other (RList) is for those that could be rescheduled to a different resource. We use these two lists in our algorithm, shown in Fig. 9.

Fig. 9
figure 9

Reservation adjustment

Reservation adjustment begins from the first list (PList). The new allocation of each postponed reservation may further influence others, on the same resource site or among the postponed task’s successors. These influenced reservations are appended to the lists based on whether they can be rescheduled or not. When all the reservations that have to be postponed have been adjusted, the algorithm begins to reschedule the reservations in the other list.

We use the algorithm described in Sect. 3, with stricter checking on reservation compatibility as discussed in Sect. 4.1. If rescheduling of a task fails, its original reservation will be postponed. The process continues until all reservations in both lists are adjusted appropriately.

5 Performance evaluation

We have conducted extensive simulation experiments to evaluate the efficacy of our algorithm. This section details experimental settings and scheduling scenarios, and presents results.

5.1 Experimental settings

In this section, we describe characteristics of computing resources and workflow jobs followed by the model we use for the inaccuracy of performance prediction.

5.1.1 Computing resources

We have simulated a virtual system consisting of 1,000 heterogeneous resource sites. Each resource is configured with different settings to simulate necessary computation features that are concerned in task scheduling. In particular, we virtualize each resource site as a tuple (computing capacity, computing power, network bandwidth); these elements are assigned with normalized values (Table 1).

Table 1 Experimental setup

Computing capacity is a parameter that describes the volume of a resource site. It is usually represented by the number of processors (CPUs or Cores) and the size of memory. We assume that at any time a certain number of processors and memory blocks are exclusively allocated to a task. Therefore, computing capacity determines how many tasks (each has a specific requirement on the number of processors and the size of memory etc.) can concurrently run on a resource site.

Computing speed represents the comprehensive performance of a resource site. It is often reflected by the frequency of CPU, bandwidth of data bus and \(I\)/\(O\) throughput of memory and disk. This parameter is used to determine the execution time of a task on a given resource site.

Network bandwidth defines the performance of external network that connects resource sites. Our target system simulates resource sites that are linked through Internet and, therefore, they are fully connected but the bandwidths vary among the links. In simulations, each resource site is assigned with a bandwidth value but the bandwidth of a link is the minimal value of the two resource sites. Network bandwidth determines transmission time when data are transferred from one resource site to another.

There are other resource properties that may be considered in scheduling. For example, tasks may require specific operating system, software or hardware devices be installed on the target resource. As defined in Sect. 3.1, these properties determine the satisfiable resources of a task. In our experiments, we configure properties of resource sites and task requirements in the way that the number of satisfiable resources for each task is in a Poisson distribution, and on average each task has 50 % of total resource sites in the system as its satisfiable resources.

5.1.2 Workflow jobs

Each workflow (DAG) is produced with varying parameters including the total number of tasks \(\vert V\vert \), the communication to computation ratio CCR and shape factor \(\alpha \) (that reflects the parallelism degree of a job). The value range of these parameters in our experiment is summarized in Table 1.

We randomly select each parameter value from the appropriate range when producing workflow jobs. Therefore, the jobs are various in size and parallelism degree. Some workflows are compute intensive and others are communication intensive.

We measure the performance of our algorithm when workflow jobs arrive sequentially; however, one may arrive before the previous ones have completed, and thus workflow jobs run concurrently in the system. The number of jobs arriving in a unit of time is Poisson distribution. By giving different arriving intervals (the mean value of the distribution), we compare the performance of scheduling algorithm under different pressure of resource contention. We use an abstract time unit in calculating workflow makespan and job arriving interval. In practice, this unit could be hour or minute, but this does not affect the experimental results.

Each task in a workflow job has a specific requirement on the number of processors and the size of memory. We normalize this as a number between one and eight to simulate various requirements on computing capacity. In practice, this requirement can be determined in advance according to the internal structure of task (parallel processes/threads, data buffer usage, etc.). We assign a number workload to each task that determines its execution time on a given resource site (as workload divided by computing speed). The length of task execution time distributes from 10 to 800 (units of time) in our experiments. A task may also require input data from its predecessors. We assume that the data size is known when scheduling and the transmission time is calculated from the network bandwidth between the two resource sites. There are also local jobs arriving randomly in each resource site. They are scheduled on the earliest time slots available on the resource site, but they cannot preempt scheduled workflow tasks. In our experiments, we set local jobs to take 30 % of resource capacity.

5.1.3 Inaccuracy in prediction

To model the inaccuracy in performance predictions, we introduce an inaccuracy ratio as: (predicted \(-\) actual)/actual. We let the value of this inaccuracy ratio be a normal distribution (with \(\mu = 0\)), both for predicting task execution time and data transmission time. In the meantime, we consider experiments with different standard deviations of the distribution, to simulate different degrees of prediction inaccuracy; the standard deviations we use are 0.1, 0.15 and 0.2.

Given the heterogeneous characteristics of the system, the predictions of the same task on different resource sites would have independently chosen inaccuracy ratios. Based on all these potentially inaccurate predictions, we make an advance reservation for each scheduled task that forms part of a newly submitted job.

5.2 Scheduling scenarios

We consider two different scenarios of multiple-workflow scheduling in our evaluation study. The distinction is based on whether workflow jobs have deadlines. The main objective when scheduling workflows with deadline is to accommodate more jobs (high throughput), whereas the objective of scheduling workflow jobs without deadline is to minimize the completion time (low makespan). Our rescheduling approach helps improve the scheduling performance for multiple workflow jobs in both scenarios.

5.2.1 Scenario 1: rescheduling for high throughput meeting deadlines

In this scheduling scenario, each workflow job comes with a deadline. That may be a requirement from users or an agreement between users and resource providers. In the experiments, we set a deadline for each workflow job according to task execution time on the average computing speed. By adjusting workflow deadlines, we intentionally make some jobs more urgent than others.

We guarantee a workflow job’s deadline by scheduling each of its tasks within a task deadline. The task deadline is calculated according to the workload proportions of tasks and edges. During the scheduling, each task is assigned to a resource instance in which it can be finished before the deadline. This assignment may need to rearrange previously scheduled tasks to leave enough computing capacity for the newly arrived ones. But if this deadline-based scheduling fails, our approach will turn to the traditional greedy algorithm that schedules each task to achieve the earliest finish time. The calculated earliest finish time (EFT) may lie beyond the deadline of the job. In that case, the job has to be rejected. We have discussed the details of the scheduling approach in [18].

5.2.2 Scenario 2: rescheduling for low makespan

In this scenario, we schedule each task in the EFT to minimize makespan. We employ our task rearrangement approach for a new task to finish earlier without influencing the makespan of other scheduled jobs.

When a task is to be scheduled, our approach first calculates the EFT it could have, based on the current resource reservations (or the current schedule). Our approach further calculates whether it could be finished even earlier than the initial EFT if any other task would be removed from the schedule. We collect a list of such tasks: by removing any of them, the task under scheduling can be finished earlier than its ‘earliest finish time’. Then, our rescheduling algorithm is employed to check whether any of these removed tasks can be rescheduled successfully. For exit tasks, the rescheduling must not extend the makespan of the job. If the rescheduling fails on all candidate tasks, the task under scheduling will be scheduled on the EFT calculated in the first step.

5.3 Experimental results

Our experiments start by evaluation of four existing workflow scheduling algorithms: HEFT [5], DCP-G [7], DAGMap [8] and meta-heuristic GRASP [9]. In particular, we compared the latter three against HEFT. We chose HEFT as a reference algorithm due to its simplicity and performance. We apply advance reservation to these algorithms for scheduling multiple workflows. For each algorithm, the average makespan is measured when a certain number of jobs are concurrently scheduled in the system. Our experiment covers situations from high resource availability to intense resource contention.

We calculate the makespan speedup ratio of each algorithm (DCP-G, DAGMap and GRASP) to HEFT. Figure 10 shows that although these efforts can shorten makespan of workflows (compared against HEFT), the performance advantage degrades quickly in multiple-workflow scheduling. Their performance is quite similar as that of HEFT under high resource contention among workflows. Therefore, we will compare the performance of our solution (DGR) against that of HEFT in different scenarios.

Fig. 10
figure 10

Performance under multiple-workflow scheduling

5.3.1 Performance in guaranteeing workflow deadline

In this series of experiments, each workflow job comes with a deadline, and the scheduler needs to ensure that accepted jobs are finished before their deadlines. For the purpose of experiment, we make some jobs be more urgent in execution than others. We introduce a parameter: urgency in producing workflow jobs. It is a ratio: (deadline \(-\) makespan)/makespan, where predicted makespan is calculated on the average computation/communication speed of the target system. The value of this parameter is chosen from a standard normal distribution among all submitted workflow jobs.

As each submitted workflow job is constrained by its deadline, resources are shared and their capacity is limited; the admission of the job is determined based on the “schedulability” of that job. In an ideal situation, the predictions on task execution time and data transmission time are accurate and all accepted workflow jobs will be finished before their deadlines as expected. However, in practice, the inaccuracy in predictions breaks this premise. One way to deal with this QoS degradation is overestimation (or pessimistic estimation with some padded value). Here, the predicted time is more likely to be longer than the actual one for the probability of delay to be reduced. This method has a side effect on the scheduling performance: it reduces acceptance rate and resource utilization, because a larger time slot for each task is reserved.

In the experiment, we apply overestimation according to the degrees of prediction inaccuracy (that were chosen with standard deviation 0.1, 0.15 or 0.2) to keep the deadline satisfiability (percentage of accepted jobs that really finish before their deadlines) at more than 95 %. Under such a situation (achieving the same deadline satisfiability), the performance of our algorithm DGR is compared against HEFT in acceptance rate and resource utilization. We also measure the running time of different algorithms for scheduling and rescheduling (handling broken reservation).

As can be seen in Fig. 11, the performance of DGR is more robust in terms both of acceptance rate and resource utilization than HEFT. In particular, DGR obtains 10–20 % higher acceptance rate and resource utilization than HEFT. With the consideration of workflow job running time, milliseconds’ increasing in scheduling and rescheduling running time is ignorable. It takes less time when prediction inaccuracy is high, because more overestimation is required in that case to obtain the same deadline satisfiability. The overestimation relaxes the resource contention in scheduling.

Fig. 11
figure 11

Performance comparison between DGR and HEFT under different degrees of prediction inaccuracy

5.4 Performance in minimizing workflow makespan

In this set of experiments, all the submitted jobs are accepted and scheduled in the minimal makespan. We measure the average makespan of workflow jobs scheduled by our algorithm and HEFT when the resource contention scales up.

Figure 12a shows the average makespan of jobs scheduled by our algorithm under different predication inaccuracy levels. Although the inaccuracy in predictions increases makespan, the impact of such inaccuracy on our algorithm is relatively less than HEFT, since our algorithm delivers a higher degree of speedup as prediction inaccuracy increases (as shown in Fig. 12b). Figure 12c gives the average running time of our algorithm. As no overestimation is applied in this scenario, much time is spent on saving broken reservations, especially when prediction inaccuracy is high. This scheduling overhead can be easily justified by performance benefits (makespan) and the fact that workflow jobs may run hours or days.

Fig. 12
figure 12

Performance of our algorithm in minimizing workflow makespan

6 Related work

Although there have been extensive studies on workflow scheduling, only a small number of them explicitly deal with scheduling of multiple, concurrent workflows (e.g., [13, 2528]). Zhao and Sakellariou [25] convert the problem of multi-workflow scheduling into a single workflow scheduling problem by merging several workflows. However, this approach becomes less effective when workflows constantly arrive due to its application to static scheduling. Yu and Shi [26] proposed an approach that dynamically prioritizes all ready tasks of different workflows. In particular, the work in [26] schedules workflows in a best effort manner without considering QoS requirements, such as workflow deadline. As the approach in [26] prioritizes small jobs, it is subject to starvation if there are many small jobs arriving constantly. Decker and Schneider [13] apply HEFT to multiple-workflow scheduling with advance reservations; this is the reference algorithm in this paper. Mao and Humphrey [27] deal with multiple workflows constantly arriving in some pattern, stable, cyclic/bursting, growing and on-and-off. The work uses a load vector to consolidate tasks to minimize #resources, or instances in Amazon EC2. The instance acquisition and release are dynamically done to minimize cost without breaching application deadlines. Malawski et al. [28] study the efficiency of executing multiple scientific workflows (termed “workflow ensembles”) on public clouds under cost and deadline constraints. The authors developed a set of algorithms to increase the cost to performance ratio and evaluated the effectiveness of these algorithms with a broad range of budget and deadline constraints. These algorithms, however, are not explicitly designed to deal with inaccuracy in performance prediction.

The prediction accuracy of task execution time and data transmission time is vital for scheduling performance and QoS support. Extensive efforts have been made on this matter including [29, 30]. Ali et al. [29] mathematically described a general procedure that first identifies performance features and perturbation parameters, and then analyses the impact of perturbation parameters on the performance features. In a later study in [30], they further developed a mathematical model to compute whether a resource allocation can probabilistically guarantee a given level of QoS. Despite these results, accurate prediction is still very rare in practice.

Shi et al. [31] investigated a bi-objective scheduling problem aiming to minimize the makespan and maximize the robustness for workflow schedules. Their study is based on an intuition that the expected makespan of a workflow job is stable if tasks are scheduled with some slacks as cushion to prediction errors. As achieving schedule robustness with slacks and minimizing makespan are two conflicting objectives, they developed a genetic algorithm, by which users can search for a better balance between the two performance metrics.

There are also studies that considered rescheduling in responding to the uncertainty of Grid environment. The work in [11, 32] adopts a policy that recalculates the schedules of all unexecuted tasks whenever the execution delay will affect the makespan of the whole job. The studies in [33, 34] proposed a rescheduling strategy for workflow jobs when a new resource is discovered or the performance of resource changes. Unlike our DGR algorithm, these approaches reschedule all unexecuted tasks whenever a change in the system occurs. This is less appealing for multiple-workflow scheduling, especially when resource contention is high.

7 Conclusion

In this paper, we have addressed multiple-workflow scheduling incorporating an efficient rescheduling heuristic with the support of task rearrangement. The scheduling of workflow applications is complicated by precedence constraints of tasks and the dynamic and heterogeneous nature of distributed resources across multiple sites. Resource contention among concurrent workflows makes this scheduling problem even more difficult hindering the effective utilization of resources. Our rescheduling mechanism exploits scheduling flexibility of workflow applications to optimize resource allocation. Our study has explicitly considered the uncertainty in resource performance prediction that significantly impacts scheduling robustness. Evaluation showed scheduling performance is greatly improved by our rescheduling approach even under a range of degrees of prediction inaccuracy.