Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In data-parallel programmingProkopec, Aleksandar Odersky, Martin models parallelism is not expressed as a set process interactions but as a sequence of parallel operations on data sets. Programs are typically composed from high-level data-parallel operations, and are declarative rather than imperative in nature, which is of particular interest when it comes to programming the ever more present multicore systems. Solutions to many computational problems contain elements which can be expressed in terms of data-parallel operations [12].

We show several examples of data-parallel programs in Fig. 1. These programs rely heavily on higher-order data-parallel operations such as map, reduce and filter, which take a function argument – they are parametrized by a mapping function, a reduction operator or a filtering predicate, respectively. The first example in Fig. 1 computes the variance of a set of measurements ms. It starts by computing the mean value using the higher-order operation sum, and then maps each element of ms into a set of squared distances from the mean value, the sum of which divided by the number of elements is the variance v. The amount of work executed for each measurement value is equal, so we call this workload uniform. This need not be always so. The second program computes all the prime numbers from \(3\) until \(N\) by calling a data-parallel filter on the corresponding range. The filter uses a predicate that checks that no number from \(2\) to \(\sqrt{i}\) divides \(i\). The workload is not uniform nor independent of \(i\) and the processors working on the end of the range need to do more work. This example also demonstrates that data-parallelism can be nested – the forall can be done in parallel as each element may require a lot of work. On the other hand, the reduce in the third program that computes a sum of numbers from \(0\) to \(N\) requires a minimum amount of work for each element. A good data-parallel scheduler must be efficient for all the workloads – when executed with a single processor the reduce in the third program must have the same running time as the while loop in the fourth program, the data-parallelism of which is not immediately obvious due to its imperative style.

Fig. 1.
figure 1

Data parallel program examples

It has been a trend in many languages to provide data-parallel bulk operations on collections [35, 17, 18]. Data-parallel operations are generic as shown in Fig. 1 – for example, reduce takes a user-provided operator, such as number addition, string concatenation or matrix multiplication. The computational costs of these generic parts, and hence the workload distribution, cannot always be determined statically, so efficient assignment of work to processors often relies on the runtime scheduling. Scheduling in this case entails dividing the elements into batches on which the processors work in isolation. Work-stealing [1, 7, 8, 15, 20] is one solution to this problem. In this technique different processors occasionally steal batches from each other to load balance the work – the goal is that no processor stays idle for too long.

In this paper we propose and describe a runtime scheduler for data-parallel operations on shared-memory architectures that uses a variant of work-stealing to ensure proper load-balancing. The scheduler relies on a novel data structure with lock-free synchronization operations called the work-stealing tree. To show that the work-stealing tree scheduler is optimal we focus on evaluating scheduler performance on uniform workloads with a minimum amount of computation per element, irregular workloads for which this amount varies and workloads with a very coarse granularity.

Our algorithm is based on the following assumptions. There are no fast, accurate means to measure elapsed time with sub-microsecond precision, i.e. there is no way to measure the running time. There is no static or runtime information about the cost of an operation – when invoking a data-parallel operation we do not know how much computation each element requires. There are no hardware-level interrupt handling mechanisms at our disposal – the only way to interrupt a computation is to have the processor check a condition. We assume OS threads as parallelism primitives, with no control over the scheduler. We assume that the available synchronization primitives are monitors and the CAS instruction. We assume the presence of automatic memory management.

The rest of the paper is organized as follows. Section 2 describes related work and alternative schedulers we compare against. Section 3 describes the work-stealing tree scheduler. In Sect. 4 we evaluate the scheduler for different workloads as well as tune several of its parameters, and in Sect. 5 we conclude.

2 Related Work

Per processor (henceforth, worker) work assignment done statically during compile time or linking, to which we will refer to as static batching, was studied extensively [13, 19]. Static batching cannot correctly predict workload distributions for any problem, as shown by the second program in Fig. 1. Without knowing the numbers in the set exactly, batches cannot be statically assigned to workers in an optimal way – some workers may end up with more work than the others. Still, although cost analysis is not the focus here, we advocate combining static analysis with runtime techniques.

To address the need for load balancing at runtime, work can be divided into a lot of small batches. Only once each worker processes its batch, it requests a new batch from a centralized queue. We will refer to this as fixed-size batching [14]. In fixed-size batching the workload itself dictates the way how work is assigned to workers. This is a major difference with respect to static batching. In general, in the absence of information about the workload distribution, scheduling should be workload-driven. A natural question arises – what is the ideal size for a batch? Ideally, a batch should consist of a single element, but the cost of requesting work from a centralized queue is prohibitively large for that. For example, replacing the increment i += 1 with an atomic CAS can increase the running time of a while loop by nearly a magnitude on modern architectures. The batch size has to be the least number of elements for which the cost of accessing the queue is amortized by the actual work. There are two issues with this technique. First, it is not scalable – as the number of workers increases, so does contention on the work queue (Fig. 6). This requires increasing batch sizes further. Second, as the granularity approaches the batch size, the work division is not fine-grained and the speedup is suboptimal (Fig. 8, where size is less than \(1024\)).

Guided self-scheduling [16] solves some granularity issues by dynamically choosing the batch size based on the number of remaining elements. At any point, the batch size is \(R_i / P\), where \(R_i\) is the number of remaining elements and \(P\) is the number of workers – the granularity becomes finer as there is less and less work. Note that the first-arriving worker is assigned the largest batch of work. If this batch contains more work than the rest of the loop due to irregularity, the speedup will not be linear. This is shown in Figs. 8-20 and 9-35. Factoring [10] and trapezoidal self-scheduling [21] improve on guided-self scheduling, but have the same issue with those workload distributions.

One way to overcome the contention issues inherent to the techniques above is to use several work queues rather than a centralized queue. In this approach each processor starts with some initial work on its queue and commonly steals from other queues when it runs out of work – this is known as work-stealing, a technique applicable to both task- and data-parallelism. One of the first uses of work-stealing dates to the Cilk language [2, 8], in which processors relied on the fast and slow version of the code to steal stack frames from each other. Recent developments in the X10 language are based on similar techniques [20]. Work-stealing typically relies on the use of work-stealing queues [1, 7, 8, 15] and deques [6], implementations ranging from blocking to lock-free. While in the past data-parallel collections frameworks relied on using task-parallel schedulers under the hood [11, 17, 18], to the best of our knowledge, the tree data structure was not used for synchronization in work-stealing prior to this work, nor for data-parallel operation scheduling.

3 Work-Stealing Tree Scheduler

In this section we describe the work-stealing tree data structure and the scheduling algorithm that the workers run. We first briefly discuss the aforementioned fixed-size batching. We have mentioned that the contention on the centralized queue is one of it drawbacks. We could replace the centralized queue with a queue for each worker and use work-stealing. However, this seems overly eager – we do not want to create as many work queues as there are workers for each parallel operation, as doing so may outweigh the actually useful work. We should start with a single queue and create additional ones on-demand. Furthermore, fixed-size batching seems appropriate for scheduling parallel loops, but what about the reduce operation? If each worker stores its own intermediate results separately, then the reduce may not be applicable to non-commutative operators (e.g. string concatenation). It seems reasonable to have the work-stealing data-structure store the intermediate results, since it has the division order information.

With this in mind, we note that a tree seems particularly applicable. When created it consists merely of a single node – a root representing the operation and all the elements of the range. The worker invoking the parallel operation can work on the elements and update its progress by writing to the node it owns. If it completes before any other worker requests work, then the overhead of the operation is merely creating the root. Conversely, if another worker arrives, it can steal some of the work by creating two child nodes, splitting the elements and continuing work on one of them. This proceeds recursively. Scheduling is thus workload-driven – nodes are created only when some worker runs out of work meaning that another worker had too much work. Such a tree can also store intermediate results in the nodes, serving as a reduction tree.

How can such a tree be used for synchronization and load-balancing? We assumed that the parallelism primitives are OS threads. We can keep a pool of threads [15] that are notified when a parallel operations is invoked – we call these workers. We first describe the worker algorithm from a high-level perspective. Each worker starts by calling the tail-recursive run method in Fig. 2. It looks for a node in the tree that is either not already owned or steals a node which some other worker works on by calling findWork in line 3. This node is initially a leaf, but we call it a subtree. The worker works on the subtree by calling descend in line 5, which calls workOn on the root of the subtree to work on it until it is either completed or stolen. In the case of a steal, the worker continues work on one of the children if it can own it in line 11. This is repeated until findWork returns \(\bot \) (null), indicating that all the work is completed.

Fig. 2.
figure 2

Work-stealing tree data-types and the scheduling algorithm

In Fig. 2 we also present the work-stealing tree and its basic data-types. We use the keyword struct to refer to a compound data-type – this can be a Java class or a C structure. We define two compound data-types. Ptr is a reference to the tree – it has only a single member child of type Node. Write access to child has to be atomic and globally visible (in Java, this is ensured with the volatile keyword). Node contains immutable references to the left and right subtree, initialized upon instantiation. If these are set to \(\bot \) we consider the node a leaf. We initially focus on parallelizing loops over ranges, so we encode the current state of iteration with three integers. Members start and until are immutable and denote the initial range – for the root of the tree this is the entire loop range. Member progress has atomic, globally visible write access. It is initially set to start and is updated as elements are processed. Finally, the owner field denotes the worker that is working on the node. It is initially \(\bot \) and also has atomic write access. Example trees are shown in Fig. 3.

Before we describe the operations and the motivation behind these data-types we will define the states work-stealing tree can be in (see Fig. 3), namely its invariants. This is of particular importance for concurrent data structures which have non-blocking operations. Work-stealing tree operations are lock-free, a well-known advantage [9], which comes at the cost of little extra complexity in this case.

INV1. Whenever a new node reference Ptr p becomes reachable in the tree, it initially points to a leaf Node n, such that n.owner = \(\bot \). Field n.progress is set to n.start and \(\mathtt{n.until }{\ge }\mathtt{n.start }\). The subtree is in the AVAILABLE state and its range is \(\langle \mathtt{n.start }{,}\mathtt{n.until }\rangle \).

INV2. The set of transitions of n.owner is \(\bot \rightarrow \pi \ne \bot \). No other field of n can be written until n.owner \(\ne \bot \). After this happens, the subtree is in the OWNED state.

INV3. The set of transitions of n.progress in the OWNED state is \(p_0 \rightarrow p_1 \rightarrow \ldots \rightarrow p_k\) such that \(\mathtt{n.start }= p_0 < p_1 < \ldots < p_k < \mathtt{n.until }\). If a worker \(\pi \) writes a value from this set of transitions to n.progress, then \(\mathtt{n.owner } = \pi \).

INV4. If the worker n.owner writes the value n.until to n.progress, then that is the last transition of n.progress. The subtree goes into the COMPLETED state.

INV5. If a worker \(\psi \) overwrites \(p_i\), such that \(\mathtt{n.start } \le p_i < \mathtt{n.until }\), with \(p_s = -p_i - 1\), then \(\psi \ne \mathtt{n.owner }\). This is the last transition of n.progress and the subtree goes into the STOLEN state.

INV6. The field p.child can be overwritten only in the STOLEN state, in which case its transition is \(\mathtt{n } \rightarrow \mathtt{m }\), where m is a copy of n with m.left and m.right being fresh leaves in the AVAILABLE state with ranges \(r_l = \langle x_0, x_1 \rangle \) and \(r_r = \langle x_1, x_2 \rangle \) such that \(r_l \cup r_r = \langle p_i, \mathtt{n.until }\rangle \). The subtree goes into the EXPANDED state.

This seemingly complicated set of invariants can be summarized in a straightforward way. Upon owning a leaf, that worker processes elements from that leaf’s range by incrementing the progress field until either it processes all elements or another worker requests some work by invalidating progress, in which case the leaf is replaced by a subtree such that the remaining work is divided between the new leaves.

Fig. 3.
figure 3

Work-stealing subtree state diagram

Fig. 4.
figure 4

Basic work-stealing tree operations

Now that we have formally defined a valid work-stealing tree, we provide an implementation of the basic operations (Fig. 4). These operations will be the building blocks for the scheduling algorithm that balances the workload. A worker must attempt to acquire ownership of a node before processing its elements by calling the method tryOwn, which returns true if the claim is successful. After reading the owner field in line 14 and establishing the AVAILABLE state, the worker attempts to atomically push the node into the OWNED state with the CAS in line 15. This CAS can fail either due to a faster worker claiming ownership or spuriously – a retry follows in both cases.

A worker that claimed ownership of a node repetitively calls tryAdvance, which attempts to reserve a batch of size STEP by atomically incrementing the progress field, eventually bringing the node into the COMPLETED state. If tryAdvance returns a nonnegative number, the owner is obliged to process that many elements, whereas a negative number is an indication that the node was stolen.

A worker searching for work must call trySteal if it finds a node in the OWNED state. This method returns true if the node was successfully brought into the EXPANDED state by any worker, or false if the node ends up in the COMPLETED state. Method trySteal consists of two steps. First, it attempts to push the node into the STOLEN state with the CAS in line 35 after determining that the node read in line 29 is a leaf. This CAS can fail either due to a different steal, a successful tryAdvance call or spuriously. Successful CAS in line 35 brings the node into the STOLEN state. Irregardless of success or failure, trySteal is then called recursively. In the second step, the expanded version of the node from Fig. 3 is created by the newExpanded method, the pseudocode of which is not shown here since it consists of isolated singlethreaded code. The child field in Ptr is replaced with the expanded version atomically with the CAS in line 39, bringing the node into the EXPANDED state.

We now describe the scheduling algorithm that the workers execute by invoking the run method. There are two basic modes of operation a worker alternates between. First, it calls findWork, which returns a node in the AVAILABLE state (line 3). Then, it calls descend to work on that node until it is stolen or completed, which calls workOn to process the elements. If workOn returns false, then the node was stolen and the worker tries to descend one of the subtrees rather than searching the entire tree for work. This decreases the total number of findWork invocations. The method workOn checks if the node is in the OWNED state (line 47), and then attempts to atomically increase progress by calling tryAdvance. The worker is obliged to process the elements after a successful advance, and does so by calling the kernel method, which is nothing more than the while loop like the one in Fig. 1. Generally, kernel can be any kind of a workload. Finally, method findWork traverses the tree left to right and whenever it finds a leaf node it tries to claim ownership. Otherwise, it attempts to steal it until it finds that it is either COMPLETED or EXPANDED, returning \(\bot \) or descending deeper, respectively. Nodes with \(1\) or less elements left are skipped.

We explore alternative findWork implementations in Sect. 4. For now, we state but do not prove the following claim. If the method findWork does return \(\bot \), then all the work in the tree was obtained by different workers that had called tryAdvance except \(M < P\) loop elements distributed across \(M\) leaf nodes where \(P\) is the number of workers. This follows from the fact that the tree grows monotonically.

Fig. 5.
figure 5

Scheduling algorithm

Note that workOn is similar to fixed-size batching – the only difference is that an arrival of a worker invalidates the node here, whereas multiple workers simultaneously call tryAdvance in fixed-size batching, synchronizing repetitively. The next section starts by evaluating the impact this has on performance.

4 Evaluation

As hinted in the introduction, we want to evaluate how good our scheduler is for uniform workloads with a low amount of work per element. The reasons for this are twofold – first, we want to compare speedups against an optimal sequential program. Second, such problems appear in practical applications. We thus ensure that the third and fourth program from Fig. 1 really have the same performance for a single processor. We will call the while loop from Fig. 1 the sequential baseline.

Parallelizing the baseline seems trivial. Assuming the workers start at roughly the same time and have roughly the same speed, we can divide the range in equal parts between them. However, an assumption from the introduction was that the workload distribution is not known and the goal is to parallelize irregular workloads as well. In fact, the workload may have a coarse granularity, consisting only of several elements.

For the reasons above, we verify that the scheduler abides the following criteria:

C1 There is no noticeable overhead when executing the baseline with a single worker.

C2 Speedup is optimal for both the baseline and typical irregular workloads.

C3 Speedup is optimal when the work granularity equals the parallelism level.

Workloads we choose correspond to those found in practice. Uniform workloads are particularly common and correspond to numeric computations, text manipulation, Monte Carlo methods and applications that involve basic linear algebra operations like vector addition or matrix multiplication. In Fig. 8 we denote this workload as UNIFORM. Triangular workloads are present in primality testing, multiplication with triangular matrices and computing an adjoint convolution (TRIANGLE). In higher dimensions computing a convolution consists of several nested loops and can have a polynomial workload distribution (PARABOLA). Depending on how the problem is formulated, the workload may be increasing or decreasing (INVTRIANGLE, HILL, VALLEY). In combinatorial problems such as word segmentation, bin packing or computing anagrams the problem subdivision can be such that the subproblems corresponding to different elements differ exponentially – we model this with an exponentially increasing workload EXP. In raytracing, PageRank or sparse matrix multiplication the workload corresponds to some probability distribution, modelled with workloads GAUSSIAN and RANDIF. Finally, in problems like Mandelbrot set computation or Barnes-Hut simulation we have large conglomeration of elements which require a lot of computation while the rest require almost no work. We call this workload distribution STEP.

All the tests were performed on an Intel i7 3.4 GHz quad-core processor with hyperthreading and Oracle JDK 1.7, using the server VM. Our implementation is written in the Scala programming language, which uses the JVM as its backend. JVM programs are commonly regarded as less efficient than programs written in C. To show that the evaluation is comparative to a C implementation, we must evaluate the performance of corresponding sequential C programs. The running time of the while loop from Fig. 1 is roughly \(45\) ms for \(150\) million elements in both C (GNU C++ 4.2) and on the JVM – if we get linear speedups then we can conclude that the scheduler is indeed optimal. We can thus turn our attention to criteria C1.

We stated already that the STEP value should ideally be \(1\) for load-balancing purposes, but has to be more coarse-grained due to communication costs that could overwhelm the baseline. In Fig. 6A we plot the running time against the STEP size, obtained by executing the baseline loop with a single worker. By finding the minimum STEP value with no observable overhead, we seek to satisfy criteria C1. The minimum STEP with no noticeable synchronization costs is around \(50\) elements – decreasing STEP to \(16\) doubles the execution time and for value \(1\) the execution time is \(36\) times larger (not shown for readability).

Fig. 6.
figure 6

Baseline running time (ms) vs. STEP size

Having shown that the work-stealing tree is as good as fixed-size batching, we evaluate its effectiveness with multiple workers. Figure 6B shows that the minimum STEP for fixed-size batching increases for \(2\) workers, as we postulated earlier. Increasing STEP decreases the frequency of synchronization and the communication costs with it. In this case the \(3\)x slowdown is caused by processors having to exchange ownership of the progress field cache-line. The work-stealing tree does not suffer from this problem, since it strives to keep processors isolated – the speedup is linear with \(2\) workers. However, with \(4\) processors the performance of the naive work-stealing tree implementation is degraded (Fig. 6C). While the reason is not immediately apparent, note that for greater STEP values the speedup is once again linear. Inspecting the number of elements processed in each node reveals that the uniform workload is not evenly distributed among the topmost nodes – communication costs in those nodes are higher due to false sharing. Even though the two processors work on different nodes, they modify the same cache line, slowing down the CAS in line 20. Why this exactly happens in the implementation that follows directly from the pseudocode is beyond the scope of this paper, but it suffices to say that padding the node object with dummy fields to adjust its size to the cache line solves this problem, as shown in Fig. 6D, E.

The speedup is still not completely linear as the number of workers grows. Our baseline does not access main memory and only touches cache lines in exclusive mode, so this may be due to worker wakeup delay or scheduling costs in the work-stealing tree. After checking that increasing the total amount of work does not change performance, we focus on the latter. Inspecting the number of tree nodes created at different parallelism levels in Fig. 7B reveals that as the number of workers grows, the number of nodes grows at a superlinear rate. Each node incurs a synchronization cost, so could we decrease their total number?

Examining a particular work-stealing tree instance at the end of the operation reveals that different workers are battling for work in the left subtree until all the elements are depleted, whereas the right subtree remains unowned during this time. As a result, the workers in any subtree steal from each other more often, hence creating more nodes. The cause is the left-to-right tree traversal in findWork as defined in Fig. 5, a particularly bad stealing strategy we will call Predefined. As shown in Fig. 7B, the average tree size for \(8\) workers nears \(2500\) nodes. So, lets try to change the preference of a worker by changing the tree-traversal order in line 70 based on the worker index \(i\) and the level \(l\) in the tree. The worker should go left-to-right if and only if \((i \gg (l \;\mathrm{mod}\;\lceil \log _2P \rceil )) \;\mathrm{mod}\;2 = 1\) where \(P\) is the total number of workers. This way, the first path from the root to a leaf up to depth \(\log _2P\) is unique for each worker. The choice of the subtree after a steal in lines 10 and 66 is also changed like this – the detailed implementation of findWork for this and other strategies is shown in the appendix. This strategy, which we call Assign, decreases the average tree size at \(P = 8\) to \(134\). Interestingly, we can do even better by doing this assignment only if the node depth is below \(\log _2P\) and randomizing the traversal order otherwise. We call this strategy AssignTop – it decreases the average tree size at \(P = 8\) to \(77\). Building on the randomization idea, we introduce an additional strategy called RandomWalk where the traversal order in findWork is completely randomized. However, this results in a lower throughput and bigger tree sizes. Additionally randomizing the choice in lines 10 and 66 (RandomAll) is even less helpful, since the stealer and the victim clash more often.

Fig. 7.
figure 7

Comparison of findWork implementations

The results of the five different strategies mentioned thus far lead to the following observation. If a randomized strategy like RandomWalk or AssignTop works better than a suboptimal strategy like Predefined then some of its random choices are beneficial to the overall execution time and some are disadvantageous. So, there must exist an even better strategy which only makes the choices that lead to a better execution time. Rather than providing a theoretical background for such a strategy, we propose a particular one which seems intuitive. Let workers traverse the entire tree and pick a node with most work, only then attempting to own or steal it. We call this strategy FindMax. Note that this cannot be easily implemented atomically, but a quiescently consistent implementation may still serve as a decent heuristic. This strategy yields an average tree size of \(42\) at \(P = 8\), as well as a slightly better throughput – we conclude by choosing it as our default strategy. Also, the diagrams in Fig. 7 reveal the postulated inverse correlation between the tree size and total execution time, both for the Intel i7-2600 and the Sun UltraSPARC T2 processor (where STEP is set to \(600\)), which is particularly noticeable for Assign when the total number of workers is not a power of two. For some \(P\) RandomAll works slightly better than FindMax on UltraSPARC, but both are much more efficient than static batching, which deteriorates heavily once \(P\) exceeds the number of cores.

Fig. 8.
figure 8

Comparison of different kernel functions I (throughput/\(s^{-1}\) vs. #workers)

Fig. 9.
figure 9

Comparison of different kernel functions II (throughput/\(s^{-1}\) vs. #workers)

The results so far go a long way in justifying that C1 is fulfilled. We focus on the C2 and C3 next by changing the workloads, namely the kernel function. Figures 8 and 9 show a comparison of the work-stealing tree and the other schedulers on a range of different workloads. Each workload pattern is illustrated prior to its respective diagrams, along with corresponding real-world examples. To avoid memory access effects and additional layers of abstraction each workload is minimal and synthetic, but corresponds to a practical use-case. To test C3, in Fig. 8-5, 6 we decrease the number of elements to \(16\) and increase the workload heavily. Fixed-size batching fails utterly for these workloads – the total number of elements is on the order of or well below the estimated STEP. These workloads obviously require smaller STEP sizes to allow stealing, but that would annul the baseline performance, and we cannot distinguish the two. We address these seemingly incompatible requirements by modifying the work-stealing tree in the following way. A mutable step field is added to Node, which is initially \(1\) and does not require atomic access. At the end of the while loop in the workOn method the step is doubled unless greater than some value MAXSTEP. As a result, workers start processing each node by cautiously checking if they can complete a bit of work without being stolen from and then increase the step exponentially. This naturally slows down the overall baseline execution, so we expect the MAXSTEP value to be greater than the previously established STEP. Indeed, on the i7-2600, we had to set MAXSTEP to \(256\) to maintain the baseline performance and at \(P = 8\) even \(1024\). With these modifications work-stealing tree yields linear speedup for all uniform workloads.

Triangular workloads such as those shown in Fig. 8-8, 9, 10 show that static batching can yield suboptimal speedup due to the uniform workload assumption. Figure 8-20 shows the inverse triangular workload and its negative effect on guided self-scheduling – the first-arriving processor takes the largest batch of work, which incidentally contains most work. We do not inverse the other increasing workloads, but stress that it is neither helpful nor necessary to have batches above a certain size.

Figure 9-28 shows an exponentially increasing workload, where the work associated with the last element equals the rest of the work – the best possible speedup is \(2\). Figure 9-30, 32 shows two examples where a probability distribution dictates the workload, which occurs often in practice. Guided self-scheduling works well when the distribution is relatively uniform, but fails to achieve optimal speedup when only a few elements require more computation, for reasons mentioned earlier.

In the STEP distributions all elements except those in some range \(\langle n_1, n_2 \rangle \) are associated with a very low amount of work. The range is set to \(25\,\%\) of the total number of elements. When its absolute size is above MAXSTEP, as in Fig. 9-34, most schedulers do equally well. However, not all schedulers achieve optimal speedup as we decrease the total number of elements \(N\) and the range size goes below MAXSTEP. In Fig. 9-35 we set \(n_1 = 0\) and \(n_2 = 0.25 N\). Schedulers other than the work-stealing tree achieve almost no speedup, each for the same reasons as before. However, in Fig. 9-36, we set \(n_1 = 0.75 N\) and \(n_2 = N\) and discover that the work-stealing tree achieves a suboptimal speedup. The reason is the exponential batch increase – the first worker acquires a root node and quickly processes the cheap elements, having increased the batch size to MAXSTEP by the time it reaches the expensive ones. The real work is thus claimed by the first worker and the others are unable to acquire it. Assuming some batches are smaller and some larger as already explained, this problem cannot be worked around by a different batching order – there always exists a workload distribution such that the expensive elements are in the largest batch. In this adversarial setting the existence of a suboptimal work distribution for every batching order can only be overcome by randomization. We omit the details due to reasons of space, but briefly explain how to randomize batching in the appendix, showing how to improve the expected speedup.

Fig. 10.
figure 10

(A) Matrix multiplication and (B) Mandelbrot sets on i7 and UltraSPARC T2

Finally, we conclude this section by comparing the new scheduler with an existing scheduler implementation used in the Scala Parallel Collections [17] in Fig. 10. The Scala Parallel Collections scheduler is an example of an adaptive data-parallel scheduler relying a task-parallel scheduler under the hood [15]. The batching order is chosen so that the sizes increase exponentially. At any point, the largest batch (task) is eligible for stealing – after a steal, the batch is divided in the same batching order. Due to the overheads of preemptively creating batch tasks and scheduling them, Scala Parallel Collections use a bound on the minimum batch size.

In Fig. 10 we evaluate the performance of Scala Parallel Collections against the new scheduler against two benchmark applications – triangular matrix multiplication and Mandelbrot set computation. Triangular matrix multiplication has a linearly increasing workload. Scala Parallel Collections scale as the number of processors increases on both the i7 and the UltraSPARC machine, although they are slower by a constant factor. However, in the Mandelbrot set benchmark where we render set in the part of the plane ranging from \((-2, -2)\) to \((32, 32)\), they do not scale beyond \(P = 2\) on the i7, and only start scaling after \(P = 16\) on the UltraSPARC. The reason is that the computationally expensive elements around the coordinates \((0, 0)\) end up in a single batch and work on them cannot be parallelized. The work-stealing tree offers a more lightweight form of work-stealing with smaller batches and better load balancing.

5 Conclusion

We presented a scheduling algorithm for data-parallel operations that fulfills the specified criteria. Based on the experiments, we draw the following conclusions:

  1. 1.

    Minimum batch size on modern architectures needed to efficiently parallelize the sequential baseline typically ranges from a few dozen to several hundred elements.

  2. 2.

    There is no need to make batches larger than some architecture-specific size MAXSTEP, which is independent of the problem size – in fact, the approach employed by guided self-scheduling and factoring can be detrimental.

  3. 3.

    Batching can and should occur in isolation – by having workers communicate only when they run out of work batching can be more fine-grained (Fig. 6).

  4. 4.

    Certain workloads require single element batches, in which case batch size has to be modified dynamically. Exponentially increasing batch size from \(1\) up to MAXSTEP works well for different workloads (Fig. 9).

  5. 5.

    When the dominant part of the workload is distributed across a range of elements smaller than MAXSTEP, the worst-case speedup can be \(1\). Randomizing the batching order can improve the average speedup.

We hinted that the work-stealing tree serves as a reduction tree, and we show the details in the appendix. We give some theoretical background to the conclusions from the experiments in the appendix as well. In the paper, we focused on parallel loops, but arrays, hash tables and trees are also eligible for parallel traversal [3, 17, 18]. The range iterator state was encoded with a single integer, but the state of other data structure iterators, as well as batching and stealing, may be more complex. While the CAS-based implementation of tryAdvance and trySteal ensures lock-freedom, CAS instructions in those methods can be replaced with short critical sections for more complicated iterators – the work-stealing tree algorithm is potentially applicable to other data structures in a straightforward way.