1 Introduction

There is a tremendous growth in medium-scale cluster computing systems at universities and small companies, as HPC systems with large parallel processing capability become broadly available. The execution time of many high-performance computing applications can be significantly reduced when using a large number of processors. Indeed, parallel multicore platforms enable the fast processing of very large-size jobs, thereby rendering the solution of challenging scientific problems more tractable. For instance, fundamental linear algebra kernels, such as dense and sparse system solvers, and eigenvalue computations, account for a significant fraction of CPU time; all these kernels are parametrized by the number of processors, which ScaLAPACK or PETSc users freely determine (Blackford et al. (1997); Balay et al. 2014)).

Although application developers seek to exploit parallelism by developing new codes, performance bottlenecks often remain because of legacy software, I/O overheads, inherently serial parts, and communication overheads. However, monopolizing all computing resources to accelerate the processing of a single application is very likely to lead to inefficient resource usage. This is because the typical speed-up profile of most applications is sub-linear and even reaches a threshold: when the number of processors increases, the execution time first decreases, but not linearly, because it suffers from the overhead due to communications and load imbalance; at some point, adding more resources does not lead to any significant benefit. Still, with more processors, the individual makespan would possibly decrease, and the typical user would enroll all available resources, hence wasting resources.

In this paper, we consider a pool of several applications that have been submitted for execution. Rather than executing each of them in sequence, with the maximum number of available resources, we introduce co-scheduling algorithms that execute several applications concurrently. We do increase the individual execution time of each application, but (i) we improve the efficiency of the parallelization, because each application is scheduled on fewer resources; (ii) the total execution time will be much shorter; and (iii) the average response time will also be shorter. In other words, co-scheduling increases platform yield (thereby saving energy) without sacrificing response time.

In operating high-performance computing systems, the costs of energy consumption can greatly impact the total costs of ownership. Consequently, there is a move away from a focus on peak performance (or speed) and towards improving energy efficiency (Kamil et al. 2008; Scogland et al. 2011). Recent results on improving the energy efficiency of workloads can be broadly classified into approaches that focus on dynamic voltage and frequency scaling, or alternatively, task aggregation or co-scheduling. In both types of approaches, the individual execution time of an application may increase but there can be considerable energy savings in processing a workload.

More formally, we deal with the following problem: given (i) a distributed-memory platform with p processors, and (ii) n applications, or tasks, \(T_{i}\), with their execution profiles (\(t_{i,j}\) is the execution time of \(T_{i}\) with j processors), what is the best way to co-schedule them, i.e., to partition them into packs, so as to minimize the sum of the execution times over all packs. Here a pack is a subset of tasks, together with a processor assignment for each task. The constraint is that the total number of resources assigned to the pack does not exceed p, and the execution time of the pack is the longest execution time of a task within that pack. The objective of this paper is to study this co-scheduling problem, both theoretically and experimentally, We aim at demonstrating the gain that can be achieved through co-scheduling, both on platform yield and response time, using a set of real-life application profiles.

On the theoretical side, the complexity of the co-scheduling problem has never been investigated, except for the simple case when one enforces that each pack comprises at most \(k=2\) tasks (Shantharam et al. 2013). While the problem has polynomial complexity with at most \(k=2\) tasks per pack  (Shantharam et al. 2013), we show that it is NP-complete when assuming at most \(k \ge 3\) tasks per pack. Note that the instance with \(k=p\) is the general, unconstrained, instance of the co-scheduling problem. We also propose an approximation algorithm for the general instance. In addition, we propose an optimal processor assignment procedure when the tasks that form a pack are given. We use these two results to derive efficient heuristics. Finally, we discuss how to optimally solve small-size instances, either through enumerating partitions into packs, or through an integer linear program: this has a potentially exponential cost, but allows us to assess the absolute quality of the heuristics that we have designed. Altogether, all these results lay solid theoretical foundations for the problem.

On the experimental side, we study the performance of the heuristics on a variety of workloads, whose application execution times model profiles of parallel scientific codes. We focus on three criteria: (i) cost of the co-schedule, i.e., total execution time; (ii) packing ratio, which evaluates the idle time of processors during execution; and (iii) response time compared to a fully parallel execution of each task starting from shortest tasks. The proposed heuristics show very good performance within a short running time, hence validating the approach.

The paper is organized as follows. We discuss related work in Sect. 2. The problem is then formally defined in Sect. 3. Theoretical results are presented in Sect. 4, exhibiting the problem complexity, discussing sub-problems and optimal solutions, and providing an approximation algorithm. Building upon these results, several polynomial-time heuristics are described in Sect. 5, and they are thoroughly evaluated in Sect. 6. Finally we conclude and discuss future work in Sect. 7.

2 Related work

In this paper, we deal with pack scheduling for parallel tasks, aiming at makespan minimization (recall that the makespan is the total execution time). The corresponding problem with sequential tasks (tasks that execute on a single processor) is easy to solve for the makespan minimization objective: simply make a pack out of the largest p tasks, and proceed likewise while there remain tasks. Note that the pack scheduling problem with sequential tasks has been widely studied for other objective functions, see Brucker et al. (1998) for various job cost functions, and Potts and Kovalyov (2000) for a survey. Back to the problem with sequential tasks and the makespan objective, Koole and Righter (2001) deal with the case where the execution time of each task is unknown but defined by a probabilistic distribution. They improve the result of Deb and Serfozo (1973), who considered the stochastic problem with identical jobs. Ikura and Gimple (1986) solve the makespan minimization problem where tasks have identical execution times, but different release times and deadlines; they assume agreeable deadlines, meaning that if a task has an earlier release time than another, it also has an earlier deadline. Koehler and Khuller (2013) propose a linear-time solution to this last problem, and further give a \(O(n^3)\) solution to the problem of minimizing the number of packs while achieving optimal makespan.

To the best of our knowledge, the problem with parallel tasks has not been studied as such. However, it was introduced in Dutot et al. (2003) as a moldable-by-phase model to approximate the moldable problem. The moldable task model is similar to the pack-scheduling model, but without the additional constraint (pack constraint) that the execution of new tasks cannot start before all tasks in the current pack are completed. Dutot et al. (2003) provide an optimal polynomial-time solution for the problem of pack scheduling identical independent tasks, using a dynamic-programming algorithm. This is the only instance of pack scheduling with parallel tasks that we found in the literature.

A closely related problem is the rectangle packing problem, or 2D-Strip packing. Given a set of rectangles of different sizes, the problem consists of packing these rectangles into another rectangle of size \(p\times m\). If one sees one dimension (p) as the number of processors, and the other dimension (m) as the maximum makespan allowed, this problem is identical to the variant of our problem where the number of processors is pre-assigned to each task: each rectangle \(r_i\) of size \(p_i \times m_i\) that has to be packed can be seen as the task \(T_i\) to be computed on \(p_i\) processors, with \(t_{i,p_i} = m_i\). Turek et al. (1994) approximated the rectangle packing problem using shelf-based solutions: the rectangles are assigned to shelves, whose placements correspond to constant time values. All rectangles assigned to a shelf have equal starting times, and the next shelf is placed on top of the previous shelf. This is exactly what we ask in our pack-scheduling model. This problem is also called level packing in some papers, and we refer the reader to a recent survey on 2D-packing algorithms in Lodi et al. (2002). In particular, Coffman et al. (1980) show that level-packing algorithm can reach a 2.7 approximation for the 2D-Strip packing problem (1.7 when the length of each rectangle is bounded by 1). Unfortunately, all these algorithms consider the number of processors (or width of the rectangles) to be already fixed for each task, hence they cannot be used directly in our problem for which a key decision is to decide the number of processors assigned to each task.

In practice, pack scheduling is really useful as shown by recent results. Li et al. (2010) propose a framework to predict the energy and performance impacts of power-aware MPI task aggregation. Frachtenberg et al. (2005) show that system utilization can be improved through their schemes to co-schedule jobs based on their load-balancing requirements and inter-processor communication patterns. Shantharam et al. (2013) study co-scheduling based on speed-up profiles, similar to our work, but packs can have only one or two tasks; still, they report faster workload completion and corresponding savings in system energy.

Several recent publications (Bhadauria and McKee 2010; Chandra et al. 2005; Hankendi and Coskun 2012) consider co-scheduling at a single multicore node, when contention for resources by co-scheduled tasks leads to complex tradeoffs between energy and performance measures. Chandra et al. (2005) predict and utilize inter-thread cache contention at a multicore in order to improve performance. Hankendi and Coskun (2012) show that there can be measurable gains in energy per unit of work through the application of their multi-level co-scheduling technique at runtime, which is based on classifying tasks according to specific performance measures. Bhadauria and McKee (2010) consider local search heuristics to co-schedule tasks in a resource-aware manner at a multicore node to achieve significant gains in thread throughput per watt.

These publications demonstrate that complex tradeoffs cannot be captured through the use of the speed-up measure alone, without significant additional measurements to capture performance variations from cross-application interference at a multicore node. Additionally, and following Shantharam et al. (2013) where packs have one or two tasks only, we expect significant benefits even when we aggregate only across multicore nodes because speed-ups suffer due to the longer latencies of data transfer across nodes. We can therefore project savings in energy as being commensurate with the savings in the time to complete a workload through co-scheduling. Hence, we only test configurations where no more than a single application can be scheduled on a multicore node.

One could ask, given a set of n tasks to schedule, why schedule them in packs rather than globally? A global schedule would avoid the gaps incurred by some processors between the end of a pack and the beginning of the next pack, thereby potentially decreasing the makespan. However, there are several reasons to prefer pack scheduling. First, a global schedule is very hard to construct. Best-known heuristics greedily assign a new task to a set of processors as soon as this set terminates execution, thereby constraining the number of resources to be the same for the new task as for the last task. Our co-schedule does not suffer from this rigidity in processor assignment decisions. Secondly, the cost of scheduling itself is greatly reduced with pack scheduling. The scheduler launches a set of tasks and transfers corresponding input data only at the beginning of a pack. No overhead is paid until all tasks in the pack return, and a new pack is executed.

The same argument applies to explain why we target parallel (also called moldable) tasks rather than malleable tasks (Drozdowski (2003); Dutot et al. 2003)). With malleable tasks, one can change the number of resources on the fly, during execution. On the contrary with parallel tasks, the number of resources is freely chosen before execution, but cannot be modified once the task is started. In practice, malleable tasks incur even more scheduling overhead, because many resource re-assignments must be accounted for during the execution of each task. In addition, it is not clear how to bound the number of such re-assignments, which is needed to get a polynomial-time schedule. Also, the speed-up profile of a malleable task may well not be same throughout execution. To give a simple example, consider an application whose task graph is a succession of fork-join graphs with decreasing degrees of parallelism (decreasing number of tasks in each fork-join subgraph). The speed-up that can be achieved decreases as the application progresses, because there is more and more overhead when fewer parallel tasks are executed, hence the speed-up profile of the application must be re-evaluated each time a resource re-assignment is performed.

Altogether, pack scheduling with parallel (or moldable) tasks achieves a good trade-off between simplicity and feasibility on one side, and practical usefulness on the other side. It incurs minimals scheduling overhead while providing both a good platform throughput and task response time, as shown experimentally in Sect. 6.

Finally, we point out that co-scheduling with packs can be seen as the static counterpart of batch scheduling techniques, where jobs are dynamically partitioned into batches as they are submitted to the system (see Muthuvelu et al. (2011) and the references therein). Batch scheduling is a complex online problem, where jobs have release times and deadlines, and when only partial information on the whole workload is known when taking scheduling decisions. On the contrary, co-scheduling applies to a set of tasks that are all ready for execution. In this paper, we use pack instead of batch to avoid any confusion.

3 Problem definition

The application consists of n independent tasks \(T_1,\dots ,T_n\). The target execution platform consists of p identical processors, and each task \(T_i\) can be assigned an arbitrary number \(\sigma (i)\) of processors, where \(1 \le \sigma (i) \le p\). The objective is to minimize the total execution time by co-scheduling several tasks onto the p resources. Note that the approach is agnostic of the granularity of each processor, which can be either a single CPU or a multicore node.

Speed-up profiles Let \(t_{i,j}\) be the execution time of task \(T_{i}\) with j processors, and \(work (i,j) = j \times t_{i,j}\) be the corresponding work. We assume the following for \(1\le i \le n\) and \(1\le j < p\):

$$\begin{aligned}&\text {Weakly decreasing execution time: } t_{i,j+1} \le t_{i,j} \end{aligned}$$
(1)
$$\begin{aligned}&\text {Weakly increasing work: } work (i,j+1) \ge work (i,j) \end{aligned}$$
(2)

Equation (1) implies that execution time is a non-increasing function of the number of processors. Equation (2) states that efficiency decreases with the number of enrolled processors: in other words, parallelization has a cost! As a side note, we observe that these requirements make good sense in practice: many scientific tasks \(T_{i}\) are such that \(t_{i,j}\) first decreases (due to load-balancing) and then increases (due to communication overhead), reaching a minimum for \(j=j_{0}\); we can always let \(t_{i,j} = t_{i,j_{0}}\) for \(j \ge j_{0}\) by never actually using more than \(j_{0}\) processors for \(T_{i}\).

Remarks Determining \(j_{0}\) for a given application is a challenge by itself. In most cases, it is obtained by profiling and interpolation. Also, in case of an imperfect knowledge of execution time profiles, it is possible to use curve-fitting techniques to construct “near complete” knowledge, and then use this constructed knowledge. We treat the same application with two different problem sizes as two different applications (their execution time profiles could potentially be different). Thus, sensitivity of runtime to different parameters that could change runtime profiles are inherently taken care of.

Co-schedules A co-schedule partitions the n tasks into groups (called packs), so that (i) all tasks from a given pack start their execution at the same time; and (ii) two tasks from different packs have disjoint execution intervals. For instance, in the example of Fig. 1, the two first packs have three tasks, the third pack has only one task, and the last pack has two tasks. The execution time, or cost, of a pack is the maximal execution time of a task within that pack, and the cost of a co-schedule is the sum of the costs of all packs.

k-in-p-Coschedules optimization problem. Given a fixed constant \(k\le p\), find a co-schedule with at most k tasks per pack that minimizes the execution time. The most general problem is when \(k=p\), but in some frameworks we may have an upper bound \(k<p\) on the maximum number of tasks within each pack.

Fig. 1
figure 1

A co-schedule with four packs \(P_{1}\)\(P_{4}\)

4 Theoretical analysis

First we discuss the complexity of the problem in Sect. 4.1, by exhibiting polynomial and NP-complete instances. Next we discuss how to optimally schedule a set of k tasks in a single pack (Sect. 4.2). Then we explain how to compute the optimal solution (in expected exponential cost) in Sect. 4.3. Finally, we provide an approximation algorithm in Sect. 4.4.

4.1 Complexity

Theorem 1

The 1-in- p-CoSchedule and 2-in- p-CoSchedule problems can both be solved in polynomial time.

Proof

This result is obvious for 1-in- p-CoSchedule: each task is assigned exactly p processors (see Eq. (1)) and the minimum execution time is \(\sum _{i=1}^n t_{i,p}\).

The proof is more involved for 2-in- p-CoSchedule, and we start with the 2-in-2-CoSchedule problem to get an intuition. Consider the weighted undirected graph \(G=(V,E)\), where \(|V|=n\), each vertex \(v_i \in V\) corresponding to a task \(T_i\). The edge set E is the following: (i) for all i, there is a loop on \(v_i\) of weight \(t_{i,2}\); (ii) for all \(i<i{^\prime }\), there is an edge between \(v_i\) and \(v_{i{^\prime }}\) of weight \(\max (t_{i,1},t_{i{^\prime },1})\). Finding a perfect matching of minimal weight in G leads to the optimal solution to 2-in-2-CoSchedule, which can thus be solved in polynomial time.

For the 2-in- p-CoSchedule problem, the proof is similar, the only difference lies in the construction of the edge set E: (i) for all i, there is a loop on \(v_i\) of weight \(t_{i,p}\); (ii) for all \(i<i{^\prime }\), there is an edge between \(v_i\) and \(v_{i{^\prime }}\) of weight \(\min _{j=1,\ldots ,p}\left( \max (t_{i,p-j},t_{i{^\prime },j}) \right) \). Again, a perfect matching of minimal weight in G gives the optimal solution to 2-in- p-CoSchedule. We conclude that the 2-in- p-CoSchedule problem can be solved in polynomial time. \(\square \)

Theorem 2

When \(k\ge 3\), the k-in- p-CoSchedule problem is strongly NP-complete.

Proof

We prove the NP-completeness of the decision problem associated to k-in- p-CoSchedule: given n independent tasks, p processors, a set of execution times \(t_{i,j}\) for \(1\le i \le n\) and \(1\le j \le p\) satisfying Eqs. (1) and (2), a fixed constant \(k\le p\) and a deadline D, can we find a co-schedule with at most k tasks per pack, and whose execution time does not exceed D? The problem is obviously in NP: if we have the composition of every pack, and for each task in a pack, the number of processors onto which it is assigned, we can verify in polynomial time: (i) that it is indeed a pack schedule; (ii) that the execution time is smaller than D.

We first prove the strong completeness of 3-in- p-CoSchedule. We use a reduction from 3-Partition. Consider an arbitrary instance \(\mathcal {I} _1\) of 3-Partition: given an integer B and 3n integers \(a_1,\ldots ,a_{3n}\), can we partition the 3n integers into n triplets, each of sum B? We can assume that \(\sum _{i=1}^{3n} a_{i} = nB\), otherwise \(\mathcal {I} _1\) has no solution. The 3-Partition problem is NP-hard in the strong sense (Garey and Johnson 1979), which implies that we can encode all integers (\(a_1\), ..., \(a_{3n}\), B) in unary. We build the following instance \(\mathcal {I} _2\) of 3-in- p-CoSchedule: the number of processors is \(p=B\), the deadline is \(D=n\), there are 3n tasks \(T_i\), with the following execution times: for all ij, if \(j<a_i\) then \(t_{i,j} =1 + \frac{1}{a_i}\), otherwise \(t_{i,j} = 1\). It is easy to check that Eqs. (1) and (2) are both satisfied. For the latter, since there are only two possible execution times for each task, we only need to check Eq. (2) for \(j=a_{i}-1\), and we do obtain that \((a_{i}-1)(1 + \frac{1}{a_i}) \le a_{i}\). Finally, \(\mathcal {I} _2\) has a size polynomial in the size of \(\mathcal {I} _1\), even if we write all instance parameters in unary: the execution time is n, and the \(t_{i,j}\)’s have the same size as the \(a_i\)’s.

We now prove that \(\mathcal {I} _1\) has a solution if and only if \(\mathcal {I} _2\) does. Assume first that \(\mathcal {I} _1\) has a solution. For each triplet \((a_i,a_j,a_k)\) of \(\mathcal {I} _1\), we create a pack with the three tasks \((T_i,\) \(T_j,T_k)\) where \(T_i\) is scheduled on \(a_i\) processors, \(T_j\) on \(a_j\) processors, and \(T_k\) on \(a_k\) processors. By definition, \(a_i+a_j+a_k = B\), and the execution time of this pack is 1. We do this for the n triplets, leading to a valid co-schedule with total execution time n. Hence the solution to \(\mathcal {I} _2\).

Assume now that \(\mathcal {I} _2\) has a solution. The minimum execution time for any pack is 1 (since it is the minimum execution time of any task and a pack cannot be empty). Hence, the solution cannot have more than n packs. Because there are 3n tasks and the number of elements in a pack is limited to three, there are exactly n packs, each of exactly 3 elements, and furthermore all these packs have an execution time of 1 (otherwise the deadline n is not matched). If there were a pack \((T_i,T_j,T_k)\) such that \(a_i+a_j+a_k>B\), then one of the three tasks, say \(T_{i}\), would have to use fewer than \(a_{i}\) processors, hence would have an execution time greater than 1. Therefore, for each pack \((T_i,T_j,T_k)\), we have \(a_i+a_j+a_k\le B\). The fact that this inequality is an equality for all packs follows from the fact that \(\sum _{i=1}^{3n} a_{i} = nB\). Finally, we conclude by saying that the set of triplets \((a_i,a_j,a_k)\) for every pack \((T_i,T_j,T_k)\) is a solution to \(\mathcal {I} _1\).

The final step is to prove the completeness of k-in- p-CoSchedule for a given \(k \ge 4\). We perform a similar reduction from the same instance \(\mathcal {I} _{1}\) of 3-Partition. We construct the instance \(\mathcal {I} _2\) of k-in- p-CoSchedule where the number of processors is \(p=B +(k-3)(B+1)\) and the deadline is \(D=n\). There are 3n tasks \(T_i\) with the same execution times as before (for \(1\le i\le 3n\), if \(j<a_i\) then \(t_{i,j} =1 + \frac{1}{a_i}\), otherwise \(t_{i,j} = 1\)), and also \(n(k-3)\) new identical tasks such that, for \(3n+1\le i \le kn\), \(t_{i,j}=\max \left( \frac{B+1}{j}, 1 \right) \). It is easy to check that Eqs. (1) and (2) are also fulfilled for the new tasks. If \(\mathcal {I} _1\) has a solution, we construct the solution to \(\mathcal {I} _2\) similarly to the previous reduction, and we add to each pack \(k-3\) tasks \(T_i\) with \(3n+1 \le i \le kn\), each assigned to \(B+1\) processors. This solution has an execution time exactly equal to n. Conversely, if \(\mathcal {I} _2\) has a solution, we can verify that there are exactly n packs (there are kn tasks and each pack has an execution time at least equal to 1). Then we can verify that there are at most \((k-3)\) tasks \(T_i\) with \(3n+1 \le i \le kn\) per pack, since there are exactly \((k-3)(B+1) + B\) processors. Otherwise, if there were \(k-2\) (or more) such tasks in a pack, then one of them would be scheduled on less than \(B+1\) processors, and the execution time of the pack would be greater than 1. Finally, we can see that in \(\mathcal {I} _2\), each pack is composed of \((k-3)\) tasks \(T_i\) with \(3n+1 \le i \le kn\), scheduled on \((k-3)(B+1)\) processors at least, and that there remain triplets of tasks \(T_i\), with \(1\le i \le 3n\), scheduled on at most B processors. The end of the proof is identical to the reduction in the case \(k=3\). \(\square \)

Note that the 3-in- p-CoSchedule problem is NP-complete, and the 2-in- p-CoSchedule problem can be solved in polynomial time, hence 3-in-3-CoSchedule is the simplest problem whose complexity remains open.

4.2 Scheduling a pack of tasks

In this section, we discuss how to optimally schedule a set of k tasks in a single pack: the k tasks \(T_1,\ldots ,T_k\) are given, and we search for an assignment function \(\sigma : \{1,\ldots ,k\} \rightarrow \{1,\ldots ,p\}\) such that \(\sum _{i=1}^k \sigma (i) \le p\), where \(\sigma (i)\) is the number of processors assigned to task \(T_i\). Such a schedule is called a 1-pack schedule, and its cost is \(\max _{1\le i\le k} t_{i,\sigma (i)}\). In Algorithm 1 below, we use the notation \(T_{i} \preccurlyeq _{\sigma } T_{j}\) if \(t_{i,\sigma (i)} \le t_{j,\sigma (j)}\):

figure a

Theorem 3

Given k tasks to be scheduled on p processors in a single pack, Algorithm 1 finds a 1-pack schedule of minimum cost in time \(O(p \log (k))\).

In this greedy algorithm, we first assign one processor to each task, and while there are processors that are not processing any task, we select the task with the longest execution time and assign an extra processor to this task. Algorithm 1 performs \(p-k\) iterations to assign the extra processors. We denote by \(\sigma ^{(\ell )}\) the current value of the function \(\sigma \) at the end of iteration \(\ell \). For convenience, we let \(t_{i,0}=+\infty \) for \(1\le i \le k\). We start with the following lemma:

Lemma At the end of iteration \(\ell \) of Algorithm 1, let \(T_{i^{\star }}\) be the first task of the sorted list, i.e., the task with longest execution time. Then, for all i, \(t_{i^{\star },\sigma ^{(\ell )}(i^{\star })} \le t_{i,\sigma ^{(\ell )}(i)-1}\).

Proof

Let \(T_{i^{\star }}\) be the task with longest execution time at the end of iteration \(\ell \). For tasks such that \(\sigma ^{(\ell )}(i)=1\), the result is obvious since \(t_{i,0}=+\infty \). Let us consider any task \(T_i\) such that \(\sigma ^{(\ell )}(i)>1\). Let \(\ell {^\prime }+1\) be the last iteration when a new processor was assigned to task \(T_i\): \(\sigma ^{(\ell {^\prime })}(i)=\sigma ^{(\ell )}(i)-1\) and \(\ell {^\prime }<\ell \). By definition of iteration \(\ell {^\prime }+1\), task \(T_i\) was chosen because \(t_{i,\sigma ^{(\ell {^\prime })}(i)}\) was greater than any other task, in particular \(t_{i,\sigma ^{(\ell {^\prime })}(i)} \ge t_{i^{\star },\sigma ^{(\ell {^\prime })}(i^{\star })}\). Also, since we never remove processors from tasks, we have \(\sigma ^{(\ell {^\prime })}(i) \le \sigma ^{(\ell )}(i)\) and \(\sigma ^{(\ell {^\prime })}(i^{\star }) \le \sigma ^{(\ell )}(i^{\star })\). Finally, \(t_{i^\star ,\sigma ^{(\ell )}(i^{\star })}\le t_{i^\star ,\sigma ^{(\ell {^\prime })}(i^{\star })} \le t_{i,\sigma ^{(\ell {^\prime })}(i)} = t_{i,\sigma ^{(\ell )}(i)-1}\). \(\square \)

(Proof of Theorem 3)

Let \(\sigma \) be the 1-pack schedule returned by Algorithm 1 of cost \(c(\sigma )\), and let \(T_{i^{\star }}\) be a task such that \(c(\sigma )=t_{i^\star ,\sigma (i^\star )}\). Let \(\sigma {^\prime }\) be a 1-pack schedule of cost \(c(\sigma {^\prime })\). We prove below that \(c(\sigma {^\prime }) \ge c(\sigma )\), hence \(\sigma \) is a 1-pack schedule of minimum cost:

  • If \(\sigma {^\prime }(i^{\star }) \le \sigma (i^{\star })\), then \(T_{i^{\star }}\) has fewer processors in \(\sigma {^\prime }\) than in \(\sigma \), hence its execution time is larger, and \(c(\sigma {^\prime }) \ge c(\sigma )\).

  • If \(\sigma {^\prime }(i^{\star }) > \sigma (i^{\star })\), then there exists i such that \(\sigma {^\prime }(i) < \sigma (i)\) (since the total number of processors is p in both \(\sigma \) and \(\sigma {^\prime }\)). We can apply the previous Lemma at the end of the last iteration, where \(T_{i^{\star }}\) is the task of maximum execution time: \(t_{i^{\star },\sigma (i^{\star })} \le t_{i,\sigma (i)-1} \le t_{i,\sigma {^\prime }(i)}\), and therefore \(c(\sigma {^\prime }) \ge c(\sigma )\).

Finally, the time complexity is obtained as follows: first we sort k elements, in time \(O(k \log k)\). Then there are \(p-k\) iterations, and at each iteration, we insert an element in a sorted list of \(k-1\) elements, which takes \(O(\log k)\) operations (use a heap for the data structure of L). \(\square \)

Note that it is easy to compute an optimal 1-pack schedule using a dynamic-programming algorithm: the optimal cost is c(kp), which we compute using the recurrence formula

$$\begin{aligned} c(i,q)=\min _{1\le q{^\prime } \le q} \{ \max (c(i-1,q-q{^\prime }),t_{i,q{^\prime }}) \} \end{aligned}$$

for \(2\le i\le k\) and \(1\le q\le p\), initialized by \(c(1,q)=t_{1,q}\), and \(c(i,0)=+\infty \). The complexity of this algorithm is \(O(kp^2)\). However, we can significantly reduce the complexity of this algorithm by using Algorithm 1.

4.3 Computing the optimal solution

In this section, we sketch two methods to find the optimal solution to the general k-in- p-CoSchedule problem. This can be useful to solve some small-size instances, albeit at the price of a cost exponential in the number of tasks n.

The first method is to generate all possible partitions of the tasks into packs. This amounts to computing all partitions of n elements into subsets of cardinality at most k. For a given partition of tasks into packs, we use Algorithm 1 to find the optimal processor assignment for each pack, and we can compute the optimal cost for the partition. We still have to calculate the minimum of these costs among all partitions.

The second method is to cast the problem in terms of an integer linear program:

Theorem 4

The following integer linear program characterizes the k-in- p-CoSchedule problem, where the unknown variables are the \(x_{i,j,b}\)’s (Boolean variables) and the \(y_b\)’s (rational variables), for \(1 \le i,b \le n\) and \(1 \le j \le p\):

$$\begin{aligned} \begin{array}{ll} {\text {Minimize}} \sum _{b=1}^{n} y_b &{} \text {subject to}\\ {\text {(i)}} \sum _{j,b} x_{i,j,b} = 1, &{} 1\le i \le n\\ {\text {(ii)}} \sum _{i,j} x_{i,j,b} \le k, &{} 1\le b \le n\\ {\text {(iii)}} \sum _{i,j} j \times x_{i,j,b} \le p, &{} 1\le b \le n\\ {\text {(iv)}} x_{i,j,b} \times t_{i,j} \le y_b, &{} 1\le i,b \le n, 1 \le j \le p\\ \end{array} \end{aligned}$$
(3)

Proof

The \(x_{i,j,b}\)’s are such that \(x_{i,j,b}=1\) if and only if task \(T_i\) is in the pack  b and it is executed on j processors; \(y_b\) is the execution time of pack  b. Since there are no more than n packs (one task per pack), \(b\le n\). The sum \(\sum _{b=1}^{n} y_b\) is therefore the total execution time (\(y_b=0\) if there are no tasks in pack  b). Constraint (i) states that each task is assigned to exactly one pack  b, and with one number of processors j. Constraint (ii) ensures that there are no more than k tasks in a pack. Constraint (iii) adds up the number of processors in pack  b, which should not exceed p. Finally, constraint (iv) computes the cost of each pack. \(\square \)

4.4 Approximation algorithm

In this section, we introduce pack-Approx, a 3-approximation algorithm for the p-in-p-CoSchedule problem: if \(\text {COST} _{\textsc {opt}}\) is the optimal solution, and \(\text {COST} _{\mathrm{algo}}\) is the output of the algorithm, we guarantee that \(\text {COST} _{\mathrm{algo}}\le 3\text {COST} _{\textsc {opt}}\). The design principle of pack-Approx is the following: we start from the assignment where each task is executed on one processor, and use Algorithm 2 to build a first solution. Algorithm 2 is a greedy heuristic that builds a co-schedule when each task is pre-assigned a number of processors for execution. Then we iteratively refine the solution, adding a processor to the task with longest execution time, and re-executing Algorithm 2. Here are details on both algorithms:

Algorithm 2 The k-in- p-CoSchedule problem with processor pre-assignments remains strongly NP-complete (use a similar reduction as in the proof of Theorem 2). We propose a greedy procedure in Algorithm 2 that is similar to the First Fit Decreasing Height algorithm for strip packing (Coffman et al. 1980). The output is a co-schedule with at most k tasks per pack, and the complexity is \(O(n \log (n))\) (dominated by sorting).

Algorithm 3 We iterate the calls to Algorithm 2, adding a processor to the task with longest execution time, until (i) either the task of longest execution time is already assigned p processors or (ii) the sum of the work of all tasks is greater than p times the longest execution time. The algorithm returns the minimum cost found during execution. The complexity of this algorithm is \(O(n^2p)\) in the simplest version presented here: in the O(np) calls to Algorithm 2, we do not need to re-sort the list but we maintain it sorted instead, thus each call except the first one has linear cost. The complexity can be reduced to \(O(n\log (n) + np)\) using standard algorithmic techniques (Cormen et al. 2009).

figure b
figure c

Theorem 5

pack-Approx is a 3-approximation algorithm for the p-in-p-CoSchedule problem.

Proof

We start with some notations:

  • step i denotes the ith iteration of the main loop of Algorithm pack-Approx;

  • \(\sigma ^{(i)}\) is the allocation function at step i;

  • \(t_{\max }(i) = \max _{j} t_{j,\sigma ^{(i)}(j)}\) is the maximum execution time of any task at step i;

  • \(j^{\star }(i) \) is the index of the task with longest execution time at step i (break ties arbitrarily);

  • \(A_{\mathrm{tot}}(i) =\sum _j t_{j,\sigma ^{(i)}(j)} \sigma ^{(i)}(j)\) is the total work that has to be done at step i;

  • \(\text {COST} _i\) is the result of the scheduling procedure at the end of step i;

  • \(\textsc {opt} \) denotes an optimal solution, with allocation function \(\sigma ^{(\textsc {opt})}\), execution time \(\text {COST} _\textsc {opt} \), and total work

    $$\begin{aligned} A_{\textsc {opt}} = \sum _j t_{j,\sigma ^{(\textsc {opt})}(j)} \sigma ^{(\textsc {opt})}(j). \end{aligned}$$

There are three different ways to exit algorithm pack-Approx:

  • If we cannot add processors to the task with longest execution time, i.e., \(\sigma ^{(i)}(j^{\star }(i))=p\);

  • If \(\frac{A_{\mathrm{tot}}(i)}{p}>t_{\max }(i)\) after having computed the execution time for this assignment;

  • When each task has been assigned p processors (the last step of the loop “for”: we have assigned exactly np processors, and no task can be assigned more than p processors).

Lemma 1

At the end of step i, \(\text {COST} _i \le 3 \max \Big (t_{\max }(i),\frac{A_{\mathrm{tot}}(i)}{p} \Big )\).

Proof

Consider the packs returned by Algorithm 2, sorted by non-increasing execution times, \(B_1,B_2,\ldots , B_n\) (some of the packs may be empty, with an execution time 0). Define, for \(1\le q\le n\),

  • \(j_q\) the task with the longest execution time of pack \(B_q\) (i.e., the first task scheduled on \(B_q\));

  • \(t_q\) the execution time of pack \(B_q\) (in particular, \(t_q =t_{j_q,\sigma ^{(i)}(j_q)}\));

  • \(A_q\) the sum of the task works in pack \(B_q\);

  • \(p_q\) the number of processors available in pack \(B_q\) when \(j_{q+1}\) was scheduled in pack \(B_{q+1}\).

With these notations, \(\text {COST} _i = \sum _{q=1}^n t_q\) and \(A_{\mathrm{tot}}(i) = \sum _{q=1}^n A_q\). For each pack, note that \(p t_q \ge A_q\), since \(p t_q\) is the maximum work that can be done on p processors with an execution time of \(t_q\). Hence, \(\text {COST} _i \ge \frac{A_{\mathrm{tot}}(i)}{p}\).

In order to bound \(\text {COST} _i\), let us first remark that \(\sigma ^{(i)}(j_{q+1}) > p_q\): otherwise \(j_{q+1}\) would have been scheduled on pack \(B_q\). Then, we can exhibit a lower bound for \(A_q\), namely \(A_q \ge t_{q+1} (p-p_q)\). Indeed, the tasks scheduled before \(j_{q+1}\) all have a length greater than \(t_{q+1}\) by definition. Furthermore, obviously \(A_{q+1} \ge t_{q+1} p_q\) (the work of the first task scheduled in pack \(B_{q+1}\)). So finally we have, \(A_q + A_{q+1} \ge t_{q+1} p\).

Summing over all q’s, we have: \(2\sum _{q=1}^n \frac{A_q}{p} \ge \sum _{q=2}^n t_q\), hence \(2\frac{A_{\mathrm{tot}}(i)}{p} +t_1 \ge \text {COST} _i\). Finally, note that \(t_1 = t_{\max }(i)\), and therefore \(\text {COST} _i \le 3 \max \left( t_{\max }(i),\frac{A_{\mathrm{tot}}(i)}{p} \right) \). Note that this proof is similar to the one for the Strip-Packing problem in Coffman et al. (1980). \(\square \)

Lemma 2

At each step i, \(A_{\mathrm{tot}}(i+1) \ge A_{\mathrm{tot}}(i)\) and \(t_{\max }(i+1)\le t_{\max }(i)\), i.e., the total work is increasing and the maximum execution time is decreasing.

Proof

\(A_{\mathrm{tot}}(i+1) = A_{\mathrm{tot}}(i)- a+b\), where

  • \(a=work (j^{\star }(i),\sigma ^{(i)}(j^{\star }(i))) \), and

  • \(b=work (j^{\star }(i),\sigma ^{(i+1)}(j^{\star }(i)))\).

But \(b =work (j^{\star }(i),\sigma ^{(i)}(j^{\star }(i))+1)\) and \(a \le b\) by Eq. (2). Therefore, \(A_{\mathrm{tot}}(i+1) \ge A_{\mathrm{tot}}(i)\). Finally, \(t_{\max }(i+1)\le t_{\max }(i)\) because only one of the longest running tasks is modified, and its execution time can only decrease thanks to Eq. (1). \(\square \)

Lemma 3

Given an optimal solution \(\textsc {opt}\), \(\forall j, t_{j,\sigma ^{(\textsc {opt})}(j)} \le \text {COST} _\textsc {opt} \) and \(A_{\textsc {opt}} \le p\text {COST} _\textsc {opt} \).

Proof

The first inequality is obvious. As for the second one, \(p\text {COST} _\textsc {opt} \) is the maximum work that can be done on p processors within a time of \(\text {COST} _\textsc {opt} \), hence it, must not be smaller than \(A_{\textsc {opt}}\), which is the sum of the work of the tasks with the optimal allocation. \(\square \)

Lemma 4

For any step i such that \(t_{\max }(i) > \text {COST} _{\textsc {opt}}\), then \(\forall j, \sigma ^{(i)}(j) \le \sigma ^{(\textsc {opt})}(j)\), and \(A_{\mathrm{tot}}(i) \le A_{\textsc {opt}}\).

Proof

Consider a task \(T_j\). If \(\sigma ^{(i)}(j) =1\), then clearly \(\sigma ^{(i)}(j) \le \sigma ^{(\textsc {opt})}(i)\). Otherwise, \(\sigma ^{(i)}(j) >1\), and then by definition of the algorithm, there was a step \(i{^\prime }<i\), such that \(\sigma ^{(i{^\prime })}(j) = \sigma ^{(i)}(j) -1\) and \(\sigma ^{(i{^\prime }+1)}(j) = \sigma ^{(i)}(j)\). Therefore \(t_{\max }(i{^\prime }) = t_{j,\sigma ^{(i{^\prime })}(j)}\).

Following Lemma 2, we have \(t_{\max }(i{^\prime }) \ge t_{\max }(i) > \text {COST} _{\textsc {opt}}\). Then necessarily, \(\sigma ^{(\textsc {opt})}(j)>\sigma ^{(i{^\prime })}(j)\), hence the result. Finally, \(A_{\mathrm{tot}}(i) \le A_{\textsc {opt}}\) is a simple corollary of the previous result and of Eq. (2). \(\square \)

Lemma 5

For any step i such that \(t_{\max }(i) > \text {COST} _{\textsc {opt}}\), then \(\frac{A_{\mathrm{tot}}(i)}{p} <t_{\max }(i)\).

Proof

Thanks to Lemma 4, we have \(\frac{A_{\mathrm{tot}}(i)}{p} \le \frac{A_{\textsc {opt}}}{p}\). Lemma 3 gives us \(\frac{A_{\textsc {opt}}}{p} \le \text {COST} _{\textsc {opt}}\), hence the result. \(\square \)

Lemma 6

There exists \(i_0 \ge 0\) such that \(t_{\max }(i_0-1) > \text {COST} _{\textsc {opt}}\ge t_{\max }(i_0)\) (we let \(t_{\max }(-1) = +\infty \)).

Proof

We show this result by contradiction. Suppose such \(i_0\) does not exist. Then \(t_{\max }(0) > \text {COST} _{\textsc {opt}}\) (otherwise \(i_0=0\) would suffice). Let us call \(i_1\) the last step of the run of the algorithm. Then by induction we have the following property, \(t_{\max }(0) \ge t_{\max }(1) \ge \cdots \ge t_{\max }(i_1) > \text {COST} _{\textsc {opt}}\) (otherwise \(i_0\) would exist, hence contradicting our hypothesis). Recall that there are three ways to exit the algorithm, hence three possible definitions for \(i_1\):

  • \(\sigma ^{(i_1)}(j^{\star }(i_1))= p\), however, then we would have \(t_{\max }(i_1) = t_{j^{\star }(i_1),p}> \text {COST} _{\textsc {opt}} \ge t_{j^{\star }(i_1),\sigma ^{(\textsc {opt})}}\) (according to Lemma 3). This contradicts Eq. (1), which states that \(t_{j^{\star }(i_1),p} \le t_{j^{\star }(i_1),k}\) for all k.

  • \(i_1 = n(p-1) -1\), but then we have the same result, i.e., \(\sigma ^{(i_1)}(j^{\star }(i_1))= p\) because this is true for all tasks.

  • \(t_{\max }(i_1) < \frac{A_{\mathrm{tot}}(i_1)}{p}\), but this is false according to Lemma 5.

We have seen that pack-Approx could not have terminated at step \(i_1\), however since pack-Approx terminates (in at most \(n(p-1)-1\) steps), we have a contradiction. Hence, we have shown the existence of \(i_0\). \(\square \)

Lemma 7

\(A_{\mathrm{tot}}(i_0) \le A_{\textsc {opt}}\).

Proof

Consider step \(i_0\). If \(i_0=0\), then at this step, all tasks are scheduled on exactly one processor, and \(\forall j, \sigma ^{(i_0)}(j) \le \sigma ^{(\textsc {opt})}(j)\). Therefore, \(A_{\mathrm{tot}}(i_0) \le A_{\textsc {opt}}\). If \(i_0 \ne 0\), consider step \(i_0-1\): \(t_{\max }(i_0-1) > \text {COST} _{\textsc {opt}}\). From Lemma 4, we have \(\forall j, \sigma ^{(i_0-1)}(j) \le \sigma ^{(\textsc {opt})}(j)\). Furthermore, it is easy to see that \(\forall j \ne j^{\star }(i_0-1), \sigma ^{(i_0)}(j)=\sigma ^{(i_0-1)}(j)\) since no task other than \(j^{\star }(i_0-1)\) is modified. We also have the following properties:

  • \(t_{j^{\star }(i_0-1),\sigma ^{(i_0-1)}( j^{\star }(i_0-1))}=t_{\max }(i_0-1)\);

  • \(t_{\max }(i_0-1) > t_{\textsc {opt}}\) (by definition of step \(i_0\));

  • \(t_{\textsc {opt}} \ge t_{j^{\star }(i_0-1),\sigma ^{(\textsc {opt})}( j^{\star }(i_0-1))}\) (Lemma 3);

  • \(\sigma ^{(i_0)}( j^{\star }(i_0-1)) = \sigma ^{(i_0-1)}( j^{\star }(i_0-1)) +1\).

The first three properties and Eq. (1) show that \(\sigma ^{(i_0-1)}( j^{\star }(i_0-1)) < \sigma ^{(\textsc {opt})}( j^{\star }(i_0-1))\). Thanks to the fourth property, \(\sigma ^{(i_0)}( j^{\star }(i_0-1)) \le \sigma ^{(\textsc {opt})}(j)\). Finally, we have, for all \(j, \sigma ^{(i_0)}(j)\le \sigma ^{(\textsc {opt})}(j)\), and therefore \(A_{\mathrm{tot}}(i_0) < A_{\textsc {opt}}\) by Eq. (2). \(\square \)

We are now ready to prove the theorem. For \(i_0\) introduced in Lemma 6, we have

$$\begin{aligned} \displaystyle \begin{array}{c} \text {COST} _{i_0} \le 3 \max \left( t_{\max }(i_0),\frac{A_{\mathrm{tot}}(i_0)}{p}\right) \le 3 \max \left( \text {COST} _{\textsc {opt}},\frac{A_{\textsc {opt}}}{p} \right) \le 3 \text {COST} _{\textsc {opt}} .\\ \end{array} \end{aligned}$$

The first inequality comes from Lemma 1. The second inequality is due to Lemmas 6 and 7. The last inequality comes from Lemma 3, hence the final result. \(\square \)

Minimum resource requirement We conclude this section on theoretical analysis by the following remark. We point out that all results can be extended to deal with the variant of the problem where each task \(T_{i} \) has a minimum compute node requirement \(m_i\). Such a requirement is typically provided by the user. In that variant, Eq. (2) is defined only for j greater than \(m_i\). For all previous algorithms, the difference lies in the preliminary step where one assigns one processor to each task: one would now assign \(m_i\) processors to task i, for all i. The number of total steps in the algorithms becomes smaller (because there are fewer processors available). One should note that with this constraint, all results (Theorems 1– 5) are still valid, and proofs are quite similar.

5 Heuristics

In this section, we describe the heuristics that we use to solve the k-in- p-CoSchedule problem. As mentioned in Sect. 2, there is no published heuristic to schedule parallel tasks into packs, which we could compare with. However, at least for smaller workloads, we will assess the absolute performance of our heuristics by comparing them to the optimal schedule (found as explained in Sect. 4.3).

Random-Pack In this heuristic, we generate the packs randomly: as long as there remain tasks, randomly choose an integer j between 1 and k, and then randomly select j tasks to form a pack. Once the packs are generated, apply Algorithm 1 to optimally schedule each of them.

Random-Proc In this heuristic, we assign the number of processors to each task randomly between 1 and p, then use Algorithm 2 to generate the packs, followed by Algorithm 1 on each pack.

A word of caution We point out that Random-Pack and Random-Proc are not pure random heuristics, in that they already benefit from the theoretical results of Sect. 4. A more naive heuristic would pick both a task and a number of processor randomly, and greedily build packs, creating a new one as soon as more than p resources are assigned within the current pack. Here, both Random-Pack and Random-Proc use the optimal resource allocation strategy (Algorithm 1) within a pack; in addition, Random-Proc uses an efficient partitioning algorithm (Algorithm 2) to create packs when resources are pre-assigned to tasks.

pack-Approx This heuristic is an extension of Algorithm 3 in Sect. 4.4 to deal with packs of size k rather than p: simply call Make-pack (\(n,p,k,\sigma \)) instead of Make-pack (\(n,p,p,\sigma \)). However, although we keep the same name as in Sect. 4.4 for simplicity, we point out that it is unknown whether this heuristic is a 3-approximation algorithm for an arbitrary k.

pack-by-pack(\(\varepsilon \)) The rationale for this heuristic is to create packs that are well balanced: the difference between the smallest and longest execution times in each pack should be as small as possible. Initially, we assign one processor per task (for \(1\le i \le n\), \(\sigma (i)=1\)), and tasks are sorted into a list L ordered by non-increasing execution times (\(\preccurlyeq _{\sigma }\) values). While there remain some tasks in L, let \(T_{i^\star }\) be the first task of the list, and let \(t_{\max } = t_{i^\star ,\sigma (i^\star )}\). Let \(V_\mathrm{{req}}\) be the ordered set of tasks \(T_{i}\) such that \(t_{i,\sigma (i)} \ge (1-\varepsilon )t_{\max }\): this is a sublist of tasks (including \(T_{i^\star }\) as its first element) whose execution times are closest to the longest execution time \(t_{\max }\), where \(\varepsilon \in [0,1]\) is a parameter: a smaller value of \(\varepsilon \) means that we select less tasks in \(V_\mathrm{{req}}\). Let \(p_\mathrm{{req}}\) be the total number of processors requested by tasks in \(V_\mathrm{{req}}\).

If \(p_\mathrm{{req}} \ge p\), a new pack is created greedily with the first tasks of \(V_\mathrm{{req}}\), adding them into the pack while there are no more than p processors used and no more than k tasks in the pack. The corresponding tasks are removed from the list L. Note that \(T_{i^\star }\) is always inserted in the created pack. Also, if we have \(\sigma (i^{\star })=p\), then a new pack with only \(T_{i^{\star }}\) is created. Otherwise (\(p_\mathrm{{req}} < p\)), an additional processor is assigned to the (currently) critical task \(T_{i^\star }\), hence \(\sigma (i^{\star }):= \sigma (i^{\star })+1\), and the process iterates after the list L is updated with the insertion of the new value for \(T_{i^{\star }}\). Finally, once all packs are created, we apply Algorithm 1 in each pack, so as to derive the optimal schedule within each pack.

Fig. 2
figure 2

Relative costs a, packing ratios b and relative response times c of co-schedules for Workload-I on 128 cores. The horizontal line indicates the relative cost, packing ratio, and relative response time, respectively, of an optimal co-schedule for Workload-I

We have \(0<\varepsilon <1\). A small value of \(\varepsilon \) will lead to balanced packs, but may end up with a single task with p processors per pack. Conversely, a large value of \(\varepsilon \) will create new packs more easily, i.e., with fewer processors per task. The idea is therefore to call the heuristic with different values of \(\varepsilon \), and to select the solution that leads to the best execution time.

Summary of heuristics We consider two variants of the random heuristics, either with one single run, or with nine different runs, hence hoping to obtain a better solution, at the price of a slightly longer execution time. These heuristics are denoted, respectively, Random-Pack-1, Random-Pack-9, Random-Proc-1, Random-Proc-9. Similarly, for pack-by-pack, we either use one single run with \(\varepsilon =0.5\) (pack-by-pack-1), or 9 runs with \(\varepsilon \in \{.1, .2, \ldots , .9\}\) (pack-by-pack-9). Of course, there is only one variant of pack-Approx, hence leading to seven heuristics.

Variants We have investigated variants of pack-by-pack, trying to make a better choice than the greedy choice to create the packs, for instance using a dynamic-programming algorithm to minimize processor idle times in the pack. However, there was very little improvement at the price of a much higher running time of the heuristics. Additionally, we tried to improve heuristics with up to 99 runs, both for the random ones and for pack-by-pack, but here again, the gain in performance was negligible compared to the increase in running time. Therefore, we present only results for these seven heuristics in the following.

6 Experimental results

In this section, we study the performance of the seven heuristics on workloads of parallel tasks. First we describe the workloads, whose application execution times model profiles of parallel scientific codes. Then we present the measures used to evaluate the quality of the schedules, and finally we discuss the results.

6.1 Workloads

Workload-I corresponds to ten parallel scientific applications that involve VASP (Kresse and Hafner 1993), ABAQUS (Borgesson 1996), LAMMPS (Plimpton 1995), and Petsc (Balay et al. 2012). The execution times of these applications were observed on a cluster with Intel Nehalem 8-core nodes connected by a QDR Infiniband network with a total of 128 cores. In other words, we have \(p=16\) processors, and each processor is a multicore node.

Workload-II corresponds to six parallel applications derived from CoMD and MiniFE proxy-applications (Heroux et al. 2009). CoMD and MiniFE mimic the behavior of a typical molecular dynamics code and unstructured implicit finite element code, respectively. The execution times of these applications were observed on 16 compute nodes of the Gordon Supercomputer (Gordon 2011). Each compute node of Gordon contains two 8-core Sandy Bridge processors and 64 GB of DDR3-1333 memory. Thus we use a total of 256 cores for evaluating Workload-II.

Workload-III is a synthetic test suite that was designed to represent a larger set of scientific applications. It models tasks whose parallel execution time for a fixed problem size m on q cores is of the form \(t(m,q) = f \times t(m,1) + (1-f) \frac{t(m,1)}{q} + \kappa (m,q)\), where f can be interpreted as the inherently serial fraction, and \(\kappa \) represents overheads related to synchronization and the communication of data. We consider tasks with sequential times t(m, 1) of the form cm, \(cm\log _2 n\), \(cm^2\), and \(cm^3\), where c is a suitable constant. We consider values of f in \(\{0, 0.04, 0.08, 0.16, 0.32\}\), with overheads \(\kappa (m,q)\) of the form \(\log _2 q\), \((\log _2 q)^2\), \(q \log _2 q\), \(\frac{m}{q}\log _2 q\), \(\sqrt{m/q}\), and \(m \log _2 q\) to create a workload with 260 tasks for 256 cores (and \(p=32\) multicore nodes), to study the scalability of our heuristics. For all workloads, we modified speed-up profiles to satisfy Eqs. (1) and (2).

We ensure that there is no compute node sharing between applications within a workload, i.e., no more than a single application can be scheduled on a compute node. We impose this restriction to avoid any potential performance issues related to resource sharing across applications within a compute node.

6.2 Methodology for assessing the heuristics

To evaluate the quality of the schedules generated by our heuristics, we consider three measures: Relative cost, Packing ratio, and Relative response time. Recall that the cost of a pack is the maximum execution time of a task in that pack and the cost of a co-schedule is the sum of the costs over all its packs.

We define the relative cost as the cost of a given co-schedule divided by the cost of a 1-pack schedule, i.e., one with each task running at maximum speed on all p processors. For a given k-in- p-CoSchedule, consider \(\sum _{i=1}^{n}{t_{i,\sigma (i)}\times \sigma (i)}\), i.e., the total work performed in the co-schedule when the i-th task is assigned \(\sigma (i)\) processors. We define the packing ratio as this sum divided by p times the cost of the co-schedule; observe that the packing quality is high when this ratio is close to 1, meaning that there is almost no idle time in the schedule.

An individual user could be concerned about an increase in response time and a corresponding degradation of individual productivity. To assess the impact on response time, we consider the performance with respect to a relative response time measure defined as follows. We consider a 1-pack schedule with the n tasks sorted in non-decreasing order of execution time, i.e., in a “shortest task first” order, to yield a minimal value of the response time. If this ordering is given by the permutation \(\pi (i), i=1,2,\ldots , n\), the response time of task i is \( r_i = \sum _{j=1}^{i}{t_{\pi (j),p}}\) and the mean response time is \(R=\frac{1}{n}\sum _{i=1}^{n}{r_i}\). For a given k-in- p-CoSchedule with u packs scheduled in increasing order of the costs of a pack, the response time of task i in pack v, \(1 \le v \le u\), assigned to \(\sigma (i)\) processors, is: \(\hat{r}_i = \sum _{\ell =1}^{v-1}{cost(\ell )} + t_{i,\sigma (i)}\), where \(cost(\ell )\) is the cost of the \(\ell \)-th pack for \(1\le \ell \le u\). The mean response time of the k-in- p-CoSchedule \(\hat{R}\) is calculated using these values and we use \(\frac{\hat{R}}{R}\) as the relative response time.

6.3 Results for Workload-I and Workload-II

For Workload-I, we consider packs of size \(k=2,4,6,8,10\) with 16 compute nodes, 8 cores per node (hence a total of 128 cores). Note that we do not try \(k=p=16\) since there are only 10 applications in this workload. For Workload-II, we consider packs of size \(k=2,4,6\) on 16 compute nodes of the Gordon Supercomputer (total of 256 cores).

Fig. 3
figure 3

Relative costs a, packing ratios b, and relative response times c of co-schedules for Workload-II on 256 cores

Fig. 4
figure 4

Relative costs a, packing ratios b, and relative response times c of co-schedules for Workload-III on 256 cores

Figures 2a and 3a show the relative cost of co-schedules computed by the heuristics. For Workload-I (Fig. 2a), the optimal co-schedule was constructed using exhaustive search. We observe that the optimal co-schedule has costs that are more than 35 % smaller than the cost of a 1-pack schedule for Workload-I. Additionally, we observe that pack-Approx and pack-by-pack compute co-schedules that are very close to the optimal one for all values of the pack size. Both Random-Pack and Random-Proc perform poorly when compared to pack-by-pack and pack-Approx, especially when a single run is performed. As expected, Random-Proc does better than Random-Pack because it benefits from the use of Algorithm 2, and for this small workload, Random-Proc-9 almost always succeed to find a near-optimal co-schedule.

Results are similar for the larger Workload-II as shown in Fig. 3a. Computing the optimal co-schedule was not feasible because of the exponential growth in running times for exhaustive search. With respect to the cost of a 1-pack schedule, we observe very significant benefits, with a reduction in costs of more than 40 % for larger values of the pack size. This corresponds to significant savings in energy consumed by the hardware for servicing a specific workload.

Figures 2b and 3b show the quality of packing achieved by the heuristics. The packing ratios are very close to one for pack-by-pack and pack-Approx, indicating that our methods are producing high-quality packings. In many cases, Random-Proc and Random-Pack lead to higher packing ratios compared to pack-by-pack and pack-Approx.

Finally, Figs. 2c and 3c show that our heuristics produce lower cost schedules with commensurate reductions in response times.

6.4 Scalability

Figure 4 shows scalability trends for Workload-III with 260 tasks on 32 8-core processors (hence a total of 256 cores.) Although the heuristics, including Random-Pack and Random-Proc, result in reducing costs relative to those for a 1-pack schedule, pack-Approx and pack-by-pack are clearly superior, even when the random schemes are run 9 times. We observe that for pack sizes of 16 and 32, pack-Approx and pack-by-pack produce high-quality co-schedules with costs and response times that are, respectively, 90 and 80 % lower than those for a 1-pack schedule. pack-by-pack-1 obtains results that are very close to those of pack-by-pack-9, hence even a single run returns a high-quality co-schedule.

6.5 Running times of heuristics

All heuristics run within a few milliseconds, even for the largest workload, and hence they are negligible in front of the time required to execute the workload, we can be from a few seconds to several minutes.

6.6 Summary of experimental results

Results indicate that heuristics pack-Approx and pack-by-pack both produce co-schedules of comparable quality. pack-by-pack-9 is slightly better than pack-by-pack-1, at a price of an increase in the running time from using more values of \(\varepsilon \). However, the running time remains very small, and similar to that of pack-Approx. Using more values of \(\varepsilon \) to improve pack-by-pack leads to small gains in performance. However, these small gains in performance correspond to significant gains in system throughput and energy, and far outweigh the costs of computing multiple co-schedules. This makes pack-by-pack-9 the heuristic of choice. Our experiments with 99 values of \(\varepsilon \) did not improve performance, indicating that large increases in the number of \(\varepsilon \) values may not be necessary.

7 Conclusion

We have developed and analyzed co-scheduling algorithms for processing a workload of parallel tasks. Tasks are assigned to processors and are partitioned into packs of size k with the constraint that the total number of processors assigned over all tasks in a pack does not exceed p, the maximum number of available processors. Tasks in each pack execute concurrently on a number of processors, and the workload completes in a time equal to the sum of the execution times of the packs. We have provided complexity results for minimizing the sum of the execution times of the packs. The bad news is that this optimization problem is NP-complete. This does not come as a surprise because we have to choose for each task both a number of processors and a pack, and this double freedom induces a huge combinatorial solution space. The good news is that we have provided an optimal resource allocation strategy once the packs are formed (Theorem 3), together with an efficient load-balancing algorithm to partition tasks with pre-assigned resources into packs. This load-balancing algorithm is proven to be a 3-approximation algorithm for the most general instance of the problem. Building upon these positive results, we have developed several heuristics that exhibit very good performance in our test sets. These heuristics can significantly reduce the time for completion of a workload for corresponding savings in system energy costs. Additionally, these savings come along with measurable benefits in the average response time for task completion, thus making it attractive from the user’s viewpoint.

These co-schedules can be computed very rapidly when speed-up profile data are available. Additionally, they operate at the scale of workloads with a few to several hundred applications to deliver significant gains in energy and time per workload. These properties present opportunities for future work: developing hybrid approaches could additionally leverage dynamic voltage and frequency scaling (DVFS) within an application. For example, Rountree et al. (2009) have shown that depending on the properties of the application, DVFS can be applied at runtime through their Adagio system, to yield system energy savings of 5–20 %. A potential hybrid scheme could start with the computation of a k-in- p-CoSchedule for a workload, following which DVFS could be applied at runtime per application.

Our work indicates the potential benefits of co-schedules for high-performance computing installations where even medium-scale facilities consume Megawatts of power. We plan to further test and extend this approach towards deployment in university scale computing facilities where workload attributes often do not vary much over weeks to months and energy costs can be a limiting factor.