Keywords

figure a

1 Introduction

Modern computing platforms increasingly use specialized hardware accelerators, such as GPUs or Xeon Phis: 102 of the supercomputers in the TOP500 list include such accelerators, while several of them include several accelerator types [24]. The increasing complexity of such computing platforms makes it hard to predict the exact execution time of computational tasks or of data movement. Thus, dynamic runtime schedulers are often preferred to static ones, as they are able to adapt to variable running times and to cope with inaccurate predictions. Indeed, with the widespread heterogeneity of computing platforms, many scientific applications now rely on runtime schedulers such as OmpSs [22], XKaapi [7], or StarPU [4]. Most of these frameworks model an application as a Directed Acyclic Graph (DAG) of tasks, where nodes represent tasks and edges represent dependences between tasks. While task graphs have been widely studied in the theoretical scheduling literature [14], most of the existing studies concentrate on static scheduling in the offline context: both the graph and the running times of the tasks are known beforehand.

We believe that there is a crucial need for online schedulers, that is, of scheduling algorithms that rely neither on the structure of the graph nor on the knowledge of tasks’ running times. First, not all graphs are fully available at the beginning of the computation: sometimes the graph itself depends on the data being processed, different inputs may result in different task graphs. This is especially the case when the behavior of an iterative application depends on the accuracy of the output. Second, in most existing runtimes, even if the graph does not depend on the input data, it is not fully submitted at the beginning of the computation; instead, tasks are dynamically uncovered during the computation. Third, even if part of the graph is available, schedulers (such as StarPU [4]) usually avoid traversing large parts of the graph each time they take a decision in order to strongly limit the time needed to take decisions. Finally, tasks’ processing times are not always known beforehand, and the occasionally available predictions may not be very accurate, as two successive executions of the same task may result in slightly different timings.

There has recently been an effort of the scheduling community to fill the gap between the assumptions used in theoretical studies and the information available to the underlying schedulers of runtime systems (see details in Sect. 2). Schedulers for independent tasks on hybrid platforms have first been proposed [5, 8, 11]. Some of them have been adapted for task graphs: [20] extends the algorithm of [5] to the (offline) scheduling of task graphs, while [2] adapts an online scheduler for independent tasks on hybrid platforms [17] to obtain a competitive online scheduler for task graphs.

In the present paper, we concentrate on the online scheduling of task graphs on a hybrid platform composed of 2 types of processors that we call CPU and GPU for convenience. There are m CPUs and k GPUs, where \(m\ge k \ge 1\). Note that we do not make any assumptions on the CPUs and GPUs (i.e., on the processing times of each task), so that these results may be symmetrically applied to the converse case with more GPUs. The objective is to schedule a DAG G of tasks, so as to minimize the total completion time, or makespan. Each task can be assigned either to a single CPU or to a single GPU. We adopt the notations of [2]: the processing time of task \(T_i\) on a CPU is noted by \(\overline{p_i}\) and on a GPU by \(\underline{p_i}\).

We consider the following online problem. At the beginning, the algorithm is aware of all the input tasks of the graph, and can schedule each one on either a CPU or on a GPU. A task is released and becomes available to the scheduler only when all its predecessors are terminated. At any given point in the computation, the scheduler is totally unaware of tasks that have not yet been released, but it knows the processing times \(\overline{p_i}\) and \(\underline{p_i}\) of all available tasks: we assume that tasks correspond to well-known kernels whose processing times have been acquired through extensive benchmarking; this happens in particular in linear algebra applications. We do not take into account the time needed for moving data and assume that there is no delay between the release of a task and the start of its processing.

The closer related work considering the very same problem is [2], which provides a \(4\sqrt{m/k}\)-competitive algorithm for this problem. We recall that an online algorithm is x-competitive if the makespan returned by this algorithm on any instance is at most x times larger than the optimal makespan (which can be computed by an offline algorithm). The present paper brings the following contributions:

  • We prove that the competitive ratio of any online algorithm is lower-bounded by \(\sqrt{m/k}\). We study how the knowledge of the task graph and the flexibility of the tasks may influence the lower bound; we especially prove that knowing the bottom-level of any task (i.e., the critical path length from this task to the end of the graph) or having preemptive tasks does not help much, whereas the knowledge of the number of descendants allows to reduce the lower bound to \(\frac{1}{2}({m/k})^{1/4}\) (Sect. 3).

  • We propose a \((2\sqrt{m/k}+1)\)-competitive algorithm, by refining both the algorithm and the analysis of [2] (Sect. 4.1).

  • We propose a simple heuristic (Sect. 4.2) based on the system-oriented heuristic EFT, which is both a competitive algorithm and performs well in practice, as we show with a comprehensive simulation set (Sect. 5).

2 Related Work

We briefly position our contributions in comparison to the existing work, starting with the offline case where the whole scheduling problem (both task dependences and running times) is known beforehand.

Offline Algorithms. Several schedulers for independent tasks on hybrid platforms have been proposed. Bleuse et al. [8] designed a polynomial but expensive \(\left( \frac{4}{3} +\frac{1}{3k}\right) \)-approximation. Low complexity algorithms, which are closer to our work, have been studied in [5, 11] and achieve approximation ratios respectively equal to 2 and \(2+\sqrt{2}\). For tasks with precedence constraints, Kedad-Sidhoum et al. [18] provided a tight 6-approximation algorithm based on linear programming. In a different setting, Raravi et al. [21] also consider the same platform composed of two types of processors, on which the objective is to schedule a set of chains of tasks, with each task having a release date and a deadline. They design an algorithm that schedules the tasks of each chain on the same processor, under some assumptions such as the existence of a valid schedule on slightly slower processors.

Online Algorithms. When tasks with precedences are released over time, Graham’s List Scheduling algorithm [16] is 2-competitive on homogeneous processors (note that this is also the best offline approximation for this problem). On our model with two sets of processors, Imreh [17] and Chen et al. [13] proposed an algorithm to schedule independent tasks with a competitive ratio smaller than 4. Based on this work, Amaris et al. [2] exhibited an online algorithm for precedence constraints, achieving a competitive ratio of \(4\sqrt{m/k}\).

Runtime Strategies. Actual runtime schedulers usually rely on low-complexity scheduling policies to limit the time needed to allocate tasks. For instance, StarPU [4] builds a performance model of tasks that enables to predict their processing times. When a new task is submitted, it is allocated to the resource that will complete it the soonest (when using the dm policy, previously called heft-tm in [3]), which corresponds to the classical Earliest Finish Time (EFT) scheduling policy [19]. Other strategies have been proposed that take into account communication times, or precomputed task priorities, depending on the descendants of each task. We include similar information in the design of the lower bounds on competitive ratios (Sect. 3).

3 Lower Bound on Competitive Algorithms

In this section, we provide a lower bound on the competitive ratio of any online algorithm, as outlined in the following theorem. We also study how adding flexibility to task processing or giving some knowledge of the graph to the scheduler impacts this lower bound.

Theorem 1

No online algorithm has a competitive ratio smaller than \(\sqrt{m/k}\).

Proof

We prove this result here only when \(\tau =\sqrt{m/k}\) is an integer. The proof for the general case can be found in the corresponding research report [10]. Consider an online algorithm \(\mathcal A\). We fix an integer n, which will later be made as large as we want for the competitive ratio to get closer to \(\tau \). We use an adversary proof: an adversary dynamically builds the graph depending on the current schedule produced by \(\mathcal A\). This results in a graph composed of nm tasks denoted by \(T^j_i\), with \(j=1,\ldots ,n\tau \) and \(i=1,\ldots ,k\tau \). Each task has a CPU processing time of \(\tau \) and a GPU processing time of 1.

The procedure consists of \(n\tau \) phases. During the jth phase, \(k\tau \) tasks are released (tasks \(T^j_i\) for \(i=1,\ldots , k\tau \)), without dependences between these tasks. The adversary selects the task that \(\mathcal A\) completes the latest, breaking ties arbitrarily. Let \(T^j_*\) be this task. The \(k\tau \) tasks of the next phase are then made successors of \(T^j_*\). See Fig. 1a for an illustration of the resulting graph.

Fig. 1.
figure 1

Illustration of the graph and the buckets for \(\tau =2\), \(k=2\), \(n=3\).

We now show how to build an efficient (offline) schedule \(\mathcal S\) of the resulting graph. A bucket is defined as a set of processors, a starting time and a duration time. We use buckets to book some processors for an amount of time, and schedule a set of tasks in a given bucket. We consider \(n+1\) buckets, as illustrated in Fig. 1b. Each bucket \(B_i\) for \(i=1,\ldots ,n\) contains all m CPUs, has a duration of \(\tau \), and starts at time \(i\tau \). Note that m tasks fit into each of these buckets. The last bucket, B, contains one GPU, starts at time 0 and lasts for a time \(n\tau \). \(\mathcal S\) schedules the \(n\tau \) tasks \(T^j_*\) successively on a single GPU, which fit into bucket B. In parallel, \(\mathcal S\) schedules the remaining tasks on CPU. More precisely, it puts in bucket \(B_\ell \) tasks \(T^j_i\) such that \((\ell -1)\tau < j \le \ell \tau \), except for tasks \(T^j_*\). They all fit into the bucket as there are less than \(\tau \times k\tau \le m\) such tasks. Moreover, task \(T^{\ell \tau }_*\) completes at time \(\ell \tau \). Therefore, every task \(T^j_i\) with \((\ell -1)\tau < j\le \ell \tau \) can be started at time \(\ell \tau \), and thus can be scheduled into bucket \(B_\ell \). Therefore, \(\mathcal S\) achieves a makespan equal to \((n+1)\tau \).

Now, we consider algorithm \(\mathcal A\), and we show that the makespan obtained is at least \(n\tau ^2\). At each phase, the adversary reveals the next phase only when all the tasks of the current phase are completed. If one task of the phase is scheduled on CPU, it takes a time \(\tau \). Otherwise, all \(k\tau \) tasks are scheduled on GPU, and the last one completes at time at least \(k\tau /k = \tau \). Therefore, \(\mathcal A\) completes each phase in time at least \(\tau \). As there are \(n\tau \) phases, the whole graph cannot be scheduled in time smaller than \(n\tau ^2\). The competitive ratio of \(\mathcal A\) is then at least:

$$ \frac{n\tau ^2}{(n+1)\tau } \quad \underset{n\rightarrow \infty }{\longrightarrow } \quad \tau .$$

   \(\square \)

It seems from the above proof that the main difficulty for this problem arises from choosing on which type of resource (CPU or GPU) a given task should be processed, and not to come up with the final schedule. This is indeed proven in the following lemma, which states that given an allocation of the tasks to the two types of resources, scheduling them among the \(m+k\) resources can be done with constant competitive ratio (for the proof, please refer to [10]).

Lemma 1

If each task can be processed on a single type of resource, then any online list scheduling algorithm is \((3-\frac{1}{m})\)-competitive, and no online algorithm has a smaller competitive ratio.

The proof of Theorem 1 heavily relies on the fact that an online algorithm has no information on the successors of each task. In practice, it is sometimes possible to get some information on the task graph, for example by pre-computing some information offline before submitting the tasks. For instance, offline schedulers usually rank available tasks with priorities based on the dependences. On homogeneous platforms, the bottom-level of a task is commonly used, and is defined as the maximum length of a path from this task to an exit node, where nodes of the graphs are weighted with the processing time of the corresponding tasks. In the heterogeneous case, the priority scheme used in the standard HEFT algorithm [25] is to set the weight of each node as the average processing time of the corresponding task on all resources.

Knowing the bottom-level does not change the lower-bound of Theorem 1: it is possible to transform the above proof using an adversary that submits tasks with identical bottom-levels in each phase (see details in the corresponding research report [10]). When there is exactly one GPU, the lower bound is decreased to \(\frac{1}{2} \sqrt{m/k}\). An interesting component of this proof is that all the tasks are equivalent (same CPU and GPU computing times) so other heterogeneous variants of the bottom-level result in the same lower bounds.

When the online scheduler is given the knowledge of the number of descendants of each submitted task in addition to their bottom-level, the lower bound of Theorem 1 is reduced to \(\frac{1}{2}({{m/k}})^{1/4}\) when m/k is large enough, so no constant-factor competitive algorithm exists. Note that all the tasks are equivalent in this proof. The lower bound is thus also valid if the knowledge of the CPU and GPU computing times of all the descendants is given to the scheduler and only the pattern of precedence relations remains unknown. Note that, however, no algorithm has been proposed that reaches this bound.

Another interesting question is whether adding flexibility on how tasks are processed changes this bound. Allowing task spoliation (where tasks can be canceled and restarted on any resource, as done in [5]) does not help, and allowing task migration (where tasks can be preempted and resumed on any resource) only halves the bounds. Table 1 summarizes the lower bounds obtained for all combination of knowledge given to the scheduler and flexibility on the task processing (for proofs, please refer to [10]).

Table 1. Lower bounds for various combinations of flexibility in task processing and knowledge given to the scheduler (BL stands for bottom-level).

4 Competitive Algorithms

4.1 The Quick Allocation (QA) Algorithm

Amaris et al. [2] designed an online algorithm named ER-LS, which is proved to be \(4\sqrt{m/k}\)-competitive. The results of Sect. 3 show that this ratio can only be improved by a constant factor, as no online algorithm can be better than \(\sqrt{m/k}\)-competitive. ER-LS applies the following processing to each available task \(T_i\):

  1. 1.
    1. (a)

      If \(T_i\) can be completed on a GPU before time \(\overline{p_i}\), then assign it to GPUs.

    2. (b)

      Else, if \(\overline{p_i} / \underline{p_i} \le \sqrt{m/k}\), then assign \(T_i\) to CPUs, else assign it to GPUs.

  2. 2.

    Schedule \(T_i\) as soon as possible on the allocated type of resource.

The main objective of Step 1a is to avoid allocating the first tasks on a slow resource, which intuitively is desirable only on small graphs. Such a technique enables a similar online algorithm to be constant-factor competitive for independent tasks, see [13]. However, it actually increases the competitive factor with precedence constraints. We propose to simplify the allocation phase by suppressing Step 1a. The resulting algorithm QA (which stands for Quick Allocation) is then defined by Steps 1b and 2. Along with a rigorous analysis, this simplification allows us to reach a competitive ratio smaller than \(2\sqrt{m/k}+1\), which is almost tight, as outlined in the following theorems. The complete proofs of the following results are available in [10].

Theorem 2

QA is \(\left( 2\sqrt{m/k}+1-\left( mk\right) ^{-1/2}\right) \) – competitive.

Proof Sketch

Consider a graph G and the schedule \(\mathcal S\) obtained by QA, of makespan \(C_{ max }\). Let \(W_c\) (resp. \(W_g\)) be the sum of the processing times of the tasks scheduled on CPU (resp. GPU) by \(\mathcal S\), and \( CP \) be the computing time of a critical path of G, given the allocation of \(\mathcal S\). We first prove that:

$$C_{ max } \le \frac{W_c}{m}+\frac{W_g}{k}+\left( 1-\frac{1}{m}\right) CP .$$

Now, focusing first on the workload in the optimal solution, and then on the length of the critical path in the optimal solution, we can show the following inequalities and conclude:

$$\begin{aligned} \frac{W_c}{m} + \frac{W_g}{k} \le \left( 1+\sqrt{\frac{m}{k}}\right) \text {OPT} \quad \text { and } \quad CP \le \sqrt{\frac{m}{k}}\text {OPT}.\end{aligned}$$

   \(\square \)

Theorem 3

The competitive ratio of QA is at least \(\left( 2\sqrt{m/k}+1-\frac{1}{k}\right) \).

Proof Sketch

Let \(\varepsilon \) be a small processing time. Consider the graph composed of the three groups of tasks below. The online instance will reveal the tasks in the same order. The only dependence is from task \(\varepsilon \) to task d.

  • Group A. \(k(k-1)\) tasks with \(\overline{p_i}=\infty \) and \(\underline{p_i}=1/k\).

  • Group B. mk tasks with \(\overline{p_i}=(1+\varepsilon )/k\) and \(\underline{p_i}=1/\sqrt{mk}\).

  • Group C. Task \(\varepsilon \), with \(\overline{p_\varepsilon }=\infty \) and \(\underline{p_\varepsilon }=\varepsilon \), and task d, with \(\overline{p_d}=\sqrt{m/k}\) and \(\underline{p_d}=1+\varepsilon \).

As depicted in Fig. 2, QA will schedule groups A and B and Task \(\varepsilon \) on GPU, then task d on CPU, for a total makespan equal to \(2\sqrt{m/k}+1-\frac{1}{k} +\varepsilon \). The optimal solution schedules only group B on CPU, for a total makespan equal to \(1+2\varepsilon \), hence the result.    \(\square \)

Fig. 2.
figure 2

Schedule obtained by QA (left) and the optimal one (right).

The proofs of these two results give some intuition on why choosing a ratio equal to \(\sqrt{m/k}\) is the best choice in Step 1b. With a smaller ratio (closer to 1), more tasks would be allocated to GPU. This would allow tasks on the critical path to be processed faster. However, the GPUs, which can be seen as a rare resource (since \(m\ge k\)), may be wasted on tasks that are not accelerated enough. For instance, if the GPU computing time of the tasks of group B in the proof of Theorem 3 were larger, such an algorithm would perform worse than QA. On the opposite, with a larger ratio (closer to m/k), the GPU would not be wasted on such tasks and the loads would be divided more equally on both types of resources. But computing the critical path, such as task d in the example graph, could be more expensive because such a task would be inefficiently executed on CPUs. Intuitively, the geometric mean between these two bounds (1 and m/k) is then the best solution.

4.2 A Competitive Algorithm that Performs Well in Practice

Although the QA algorithm has the best known competitive ratio, the greedy strategy EFT (see Sect. 2) actually leads to better schedules on most realistic instances because it balances the load among the resources. However, its performance can be \(2+(m-1)/k\) times worse than the optimal solution (see [10] for a proof of this result).

We propose a new tunable algorithm, named MixEFT that benefits both from the performance of EFT on most instances, and from the robustness of QA on the hardest graphs. The idea is to improve EFT by switching to a guaranteed algorithm if EFT does not perform well enough. The algorithm is composed of two phases. In the first phase, it is equal to EFT except that it also simulates the schedule that QA would have produced on the same instance. If the makespan obtained by EFT is more than \(\lambda \) times larger than the makespan obtained by the simulated QA (for a fixed positive parameter \(\lambda \)), we switch to the second phase, and MixEFT from this point behaves as QA. A small \(\lambda \) leads to a smaller competitive ratio, but may degrade the performance of MixEFT in practice.

The competitive ratio of this algorithm is in \(O(\lambda \sqrt{m/k})\). Indeed, the first phase cannot lead to a schedule more than \(\lambda \) times worse than QA, and the second phase has the competitive ratio of QA. Therefore, the algorithm is \((\lambda +1)(2\sqrt{m/k}+1)\)-competitive (see [10] for more details). Note that this competitive ratio is not tight. The worst performance observed so far is \(\max (\lambda ,2\sqrt{m/k}+1)\).

5 Simulations

We now provide simulations to illustrate the performance of both competitive algorithms and simple heuristic strategies on various task graphs.

5.1 Baseline Heuristics

In addition to the four online algorithms discussed above (ER-LS from [2], QA, EFT, and MixEFT, implemented with \(\lambda =2\) unless otherwise specified), we consider two simple strategies that follow the same scheme as QA, with a different allocation criteria: Quickest allocates each task to the resource type on which its computing time is smaller; Ratio allocates a task on GPUs if and only if its GPU computing time is at least m/k times smaller than its CPU computing time. Intuitively, Quickest should perform well on graphs on which the critical path is preponderant as it minimizes the execution time of each task. On the opposite, Ratio should perform well on graphs with a high parallelism throughout the execution, as it will execute more tasks concurrently on the CPUs. We also used the offline HEFT algorithm [25], which is known to perform well in practice, as a baseline to compare all online strategies.

5.2 Experimental Setup

We used three types of instances: realistic DAGs corresponding to a linear algebra application, namely the Cholesky factorization, random DAGs used in the literature, and ad hoc instances designed to be difficult for this problem and specifically for QA.

Cholesky factorization is a linear algebra application whose parallel implementation usually uses a blocked algorithm on a tiled matrix for performance issues. We consider matrix sizes ranging from \(2\times 2\) tiles to \(15\times 15\) tiles, which leads to DAGs with 4 to 680 tasks. Tasks correspond to four linear algebra kernels: GEMM, SYRK, TRSM, and POTRF. Their respective processing times on a CPU are set to 170ms, 95 ms, 88 ms, and 33 ms, and on a GPU to 5.95 ms, 3.65 ms, 8.11 ms, and 15.6 ms, which corresponds to measures [1, 6] made using the Chameleon software [12].

The random instances come from the STG set [23], which is often used in the literature to compare the performance of scheduling strategies. We report here the simulations made with 180 graphs of 300 nodes each. We consider that the cost generated by the STG random generator is the processing time of the corresponding task on a GPU. Based on the previous measures for linear algebra kernels, we assume that the average speedup between CPU and GPU is around 15 with a large variance. Thus, to obtain the processing time of a task on CPU, we multiply its cost on GPU by a random value with expected value 15 and standard deviation 15. For that, we use a gamma distribution because it has been advocated for modeling job runtimes [15], it is positive and it is possible to specify its expected value and standard deviation by adjusting its parameters.

Finally, specific random instances have been designed to test the limitations of QA. These ad hoc instances consist of a chain of tasks together with a set of independent tasks, such that all cores are expected to finish simultaneously if a GPU is dedicated to the chain and all independent tasks are load-balanced on the other cores. The expected processing time of each task on a GPU is 1 (with a standard deviation of 0.1). Each instance is parameterized by a number \(\mu \), which represents the expected processing time on a CPU, and varies from \((m/k)^{-1/4}\) to \((m/k)^{5/4}\) (the standard deviation of the CPU processing times is equal to 10% of \(\mu \)). For a given expected CPU cost \(\mu \), the number of tasks in the chain is equal to \(\lceil \frac{n}{m/\mu +k} \rceil \), where \(n=300\) is the total number of tasks. Therefore, the larger \(\mu \), the longer the chain.

We have performed simulations for various platform sizes, whose results are available in [10]. As expected from the theoretical analysis, the behaviors of the heuristics mainly depend on the value m/k. For the sake of brevity, we only report here the results obtained for \(m=20\) CPUs and \(k=2\) GPUs, as it is representative of the results for relatively large values of m/k. The code and scripts used for the simulations and the data analysis are available online [9].

Fig. 3.
figure 3

Ratios of the makespan over HEFT for EFT, QA, ER-LS, Ratio, Quickest, and MixEFT with \(m=20\) CPUs and \(k=2\) GPUs. Except in Figure (d), MixEFT is not shown because it performs exactly as EFT. In Figure (d), ER-LS, Ratio, and Quickest are discarded.

5.3 Results

Figure 3 depicts the performance of the six online scheduling algorithms. Except when varying its parameter (Fig. 3(d)), MixEFT performs exactly as EFT (and is thus omitted for better readability). On Cholesky DAGs (Fig. 3(a)), EFT (and thus MixEFT) is always the best strategy. The only difference between QA and ER-LS concerns the first tasks (as we removed Step 1a in QA), which explains why their behaviour is similar for large graphs. QA, ER-LS, and Ratio all put POTRF tasks on the CPU, which leads to performance loss when the graph is small because its parallelism is limited and the GPUs are often idle. However, it is acceptable for larger graphs in which many tasks may be executed in parallel on the GPUs. On the contrary, Quickest puts all tasks on the GPUs. This is efficient for small graphs with low parallelism but it becomes worse than Ratio for large graphs.

Figure 3(b) shows similar trends on the random graphs from STG set: EFT (and thus MixEFT) gives the best results, followed by QA and ER-LS.

Figure 3(c) first shows that EFT (and MixEFT) is almost always the best online heuristic for these ad hoc graphs. For extreme values of the expected CPU processing time \(\mu \) (significantly smaller than 1 or larger than m/k), all four other heuristics are equivalent and perform well. Otherwise, when \(\mu \) is slightly larger than 1, the instance contains many independent tasks and Quickest is almost m/k worst than HEFT because scheduling independent tasks on GPUs is not efficient. Symmetrically, when \(\mu \) is slightly smaller than m/k, the instance contains a large critical path and Ratio shows poor performance, because it schedules the critical path on CPUs. QA and ER-LS take the best of these two strategies, and have a worst performance \(\sqrt{m/k}\approx 3\) times larger than HEFT, when \(\mu \) is close to \(\sqrt{m/k}\).

Figure 3(d) shows that MixEFT behaves like QA when its \(\lambda \) parameter is smaller than 1, and rapidly changes to mimic EFT when the parameter increases and exceeds 1. Note that in all studied instances, EFT was never far from HEFT and that there is no practical gain of using MixEFT rather than EFT. The main advantage of MixEFT lies in its competitive ratio whereas EFT can lead to very large makespans on specific instances.

6 Conclusion

In this paper, we have focused on the problem of scheduling task graphs on hybrid platforms made of two types of processors, such as CPUs and GPUs. We have studied the online case, when only the tasks whose predecessors are all completed are known to the scheduler, and the graph is thus gradually discovered. We proved that no scheduling algorithm can have a competitive ratio smaller than \(\sqrt{m/k}\), and studied how this ratio varies when more knowledge on the graph is given to the scheduler and/or tasks may be migrated between processors. We have proposed a \((2\sqrt{m/k}+1)\)-competitive algorithm as well as a mixed strategy, which is both \(\varTheta (\sqrt{m/k})\)-competitive and performs as well as the best heuristics in practice. This is demonstrated through an extensive set of simulations. Our future work includes taking into account communication times when moving data from/to the GPUs, and coping with inaccurate processing time estimates.