Keywords

1 Introduction

Recently, multicore and manycore computer architectures have become very attractive for achieving high performance execution of scientific applications at relatively low costs [8, 17, 20]. Modern CPUs and accelerators achieve performance that was recently reached by supercomputers. Unfortunately, the process of adapting existing software to such new architectures can be difficult if we expect to achieve reasonable performance without putting much effort into software development. For example, the use of OpenCL [10] leads to a substantial increase of software complexity. However, sometimes the use of high-level language-based programming interfaces devoted to parallel programming can get satisfactory results with rather little effort [19].

Software development process for modern Intel multicore CPUs and manycore coprocessors like Xeon Phi [8, 17] requires special optimization techniques to obtain codes that would utilize the power of underlying hardware. Usually it is not sufficient to parallelize applications because in case of such computer architectures efficient vectorization is crucial for achieving satisfactory performance [8, 20]. Unfortunately, very often compiler-based automatic vectorization is not possible because of some non-obvious data dependencies inside loops [1, 21]. On the other hand, people expect parallel programming to be easy and they prefer to concentrate on algorithms and use simple and powerful programming constructs that can utilize underlying hardware.

Cilk Plus introduces new extensions to C/C++ programming languages to express task and data parallelism using high-level constructs [8, 9, 12, 18]. Although Cilk Plus has more usability than OpenMP [6], it is not very popular (several interesting applications can be found in [2, 3, 13, 15]).

In this paper we show that Cilk Plus can be very easily applied to parallelize recursively defined adaptive Simpson’s integration rule [11] and such implementation can be easily transformed to utilize coprocessors like Intel Xeon Phi. We also advise how to simplify move from OpenMP to Cilk Plus and improve the performance of such algorithms by tuning data structures to utilize hardware (i.e. vector units) of modern multicore and manycore processors. As an example we consider our Cilk Plus implementation of Belman-Ford algorithm for solving the single-source shortest-path problem [7] which achieves better performance than the corresponding simple OpenMP version of the algorithm. These two computational problems have been chosen to demonstrate the most important features of Cilk Plus that can be easily added to sequential C/C++ programs.

2 Short Overview of Cilk Plus

Cilk Plus offers several powerful extensions to C/C++ that allow to express both task and data parallelism [8, 17]. The most important constructs are useful to specify and handle possible parallel execution of tasks:

  • cilk_for followed by the body of a for loop tells that iterations of the loop can be executed in parallel. Runtime applies the divide-and-conquer approach to schedule tasks among active workers to ensure balanced workload of available cores.

  • cilk_spawn permits a given function to be executed asynchronously with the rest of the calling function.

  • cilk_sync tells that all tasks spawned in a function must complete before execution continues.

Another important feature of Cilk Plus is the array notation which introduces vectorized operations on arrays. Expression A[start:len:stride] represents an array section of length len starting from A[start] with the given stride. Omitted stride means 1. The operator [:] can be used on both static and dynamic arrays. There are also several built-in functions to perform basic computations among elements in an array such as sum, min, max etc. It should be noticed that the array notation can also be used for array indices. For example, A[x[0:len]] denotes elements of the array A given by indices from x[0:len].

Intel Cilk Plus also supports Shared Virtual Memory which allows to share data between the CPU and the coprocessor what is promising especially for complex data structures [8, 17]. Such shared variables are declared using _Cilk_shared keyword. It also allows to declare functions that should be available for CPU and coprocessors. Computations can be offloaded to coprocessors for asynchronous execution using _Cilk_spawn _Cilk_offload construct. In such a case all necessary data are moved to the coprocessor. Memory synchronization between the CPU and the coprocessor takes place when an offloaded function is called by CPU or an offloaded function returns (i.e. when cilk_sync is used). The description of other features of Cilk Plus (like reducers) can be found in [17].

3 Two Examples of Computational Problems

Now we will present two exemplary problems which can be easily parallelized and optimized using Cilk Plus. All implementations have been tested on a server with two Intel Xeon E5-2670 v3 (totally 24 cores with hyperthreading, 2.3 GHz), 128 GB RAM, with Intel Xeon Phi Coprocessor 7120P (61 cores with multithreading, 1.238 GHz, 16 GB RAM), running under CentOS 6.5 with Intel Parallel Studio ver. 2017, C/C++ compiler supporting Cilk Plus. Experiments on Xeon Phi have been carried out using its native and offload modes.

3.1 Adaptive Simpson’s Integration Rule

Let us consider the following recursive method for numerical integration called Adaptive Simpson’s Rule [11]. We want to find the approximation of

$$\begin{aligned} I(f) = \int _a^b f(x) dx \end{aligned}$$
(1)

with a user-specified tolerance \(\epsilon \). Let \(S(a,b)=\frac{h}{6}\left( f(a)+4f(c)+f(b)\right) \), where \(h=b-a\) and c is a midpoint of the interval [ab]. The method uses Simpson’s rule to the halves of the interval in recursive manner until the following stopping criterion is reached [14]:

$$\begin{aligned} \frac{1}{15}|S(a,c)+S(c,b)-S(a,b)|<\epsilon . \end{aligned}$$
(2)

Figure 1 shows our parallel version of the straightforward recursive implementation of the method [4]. Note that we have only included keywords _Cilk_spawn and _Cilk_sync. The first one specifies that cilkAdaptiveSimpsonsAux() can execute in parallel with the remainder of the calling kernel. _Cilk_sync tells that all spawned calls in the current call of the kernel must complete before execution continues. For comparative purposes we have also implemented the method using OpenMP tasks [16], where the keywords _Cilk_spawn and _Cilk_sync are simply replaced with task and taskwait constructs.

Fig. 1.
figure 1

Parallel version of Adaptive Simpson’s method

Another Cilk implementation of the method assumes that some computations can be offloaded to a coprocessor (i.e. Xeon Phi, if available). The auxiliary kernel cilkAdaptiveSimpsonsAux() should be declared with the keyword _Cilk_shared, what makes the function available for CPU and coprocessors. In the main kernel cilkAdaptiveSimpsonsOff(), the integration over the first half of the interval [ab] can offloaded to Xeon Phi using _Cilk_spawn _Cilk_offload construct, while the rest is to be done by CPU.

Table 1 shows the execution time of our three parallel implementations applied for finding the approximation of \(\int _{-4.4}^{4.4}\exp (x^2)dx\) with \(\epsilon =1.0e-7\) and \(depth=40\) (namely OpenMP with tasks, Cilk, and Cilk with offload). We can observe that cilkAdaptiveSimpsons() outperforms ompAdaptiveSimpsons() significantly (about four times faster for CPU and three times for Xeon Phi). It should be noticed that the execution time (seconds) of the sequential version of the method is 62.8 for CPU and 638.04 for Xeon Phi. Thus, the speedup achieved by our Cilk implementation is 14.35 (CPU) and 70.66 (Xeon Phi), respectively.

Our non-offloaded Cilk version scales very well when the number of Cilk workers increases up to 24 for CPU and 60 for Xeon Phi, respectively, i.e. to the number of physical cores. The further increase in the number of workers results in smaller and rather marginal gains. For cilkAdaptiveSimpsonsOff(), we can observe that the shortest execution time is achieved for twelve workers. Then the execution time of cilkAdaptiveSimpsonsAux() on CPU and Xeon Phi working on the halves of the interval is approximately the same.

Table 1. Execution time (s) of ompAdaptiveSimpsons(), cilkAdaptiveSimpsons() and cilkAdaptiveSimpsonsOff() for \(\int _{-4.4}^{4.4}\exp (x^2)dx\)

3.2 Bellman-Ford Algorithm for the Single-Source Shortest-Path Problem

Let \(G=(V,E)\) be a directed graph with n vertices labeled from 0 to \(n-1\) and m arcs \(\langle u,v\rangle \in E\), where \(u,v\in V\). Each arc has its weight \(w(u,v)\in \mathbf {R}\) and we assume \(w(u,v)=\infty \) when \(\langle u,v\rangle \not \in E\). For each path \(\langle v_0,v_1,\ldots ,v_p\rangle \) we define its length as \(\sum _{i=1}^p w( v_{i-1},v_{i})\). We also assume that G does not contain negative cycles. Let d(st) denotes the length of the shortest path from s to t or \(d(s,t)=\infty \) if there are no paths from s to t.

Algorithm 1 is the well-known Belman-Ford method for finding shortest lengths of paths from a given source \(s\in V\) to all other vertices [7].

figure a

The most common basic implementations of the algorithm assume that a graph is represented as an array that describes its vertices. Each vertex is described by an array containing information about incoming arcs. Each arc is represented by the initial vertex and arc’s weight. It is also necessary to store the length of arrays describing vertices. In order to parallelize such a basic implementation using OpenMP (see Fig. 2, left), we should notice that the entire algorithm should be within the parallel construct. Then the loops 7–13 and 18–26 can be parallelized using for construct with clause schedule(dynamic,ChS). Thus, iterations are divided into pieces having a size specified by chunk size ChS and such pieces are dynamically dispatched to threads. The assignment in line 4 needs to be a single task (i.e. defined by single). Moreover, we need two copies of the array D for storing current and previous updates within each iteration of the loop 20–25. It should be noticed that this loop is automatically vectorized by the compiler. For the sake of simplicity, we also assume that the vertex labeled as 0 is the source.

Fig. 2.
figure 2

Belman-Ford algorithm implemented using OpenMP and Cilk Plus

In our Cilk Plus implementation (see Fig. 2, right), the loops 7–13 and 18–26 are parallelized using cilk_for construct. We also assume that each vertex of a given graph is represented by two arrays of the same size. The first one (i.e. inv) sorted in increasing order contains labels of initial vertices of incoming arcs. The next one (i.e. inw) stores weights of corresponding arcs. Then (lines 21–24) we can simply vectorize the body of the loop using built-in function __sec_reduce_min() to find minimum among elements in the array given by the sum of the array inw and necessary elements from the array d1 given by indices from inv. This is a very fine example of using the array notation.

Table 2. Execution time (in seconds) of three implementations of Algorithm 1
Fig. 3.
figure 3

Speedup of OpenMP and Cilk Plus implementations versus non-parallelized basic version of Belman-Ford algorithm

Table 2 shows the results of experiments performed for the considered implementation of Belman-Ford algorithm, namely basic, ompBF1, ompBF2 and cilkBF. Note that ompBF2 is another implementation that uses OpenMP and the same data layout as cilkBF. All results have been obtained for graphs generated randomly for a given number of vertices and maximum degree (i.e. the maximum number of incoming arcs). We can observe that the parallel implementations are much faster than the basic (i.e. non-parallelized) implementation of Algorithm 1. Usually ompBF2 is faster than ompBF1. cilkBF outperforms ompBF1 and ompBF2 for larger and wider graphs. However, in case of our OpenMP implementations, Table 2 shows the best results chosen from several runs for various values of ChS. Thus, one can say that our OpenMP versions have been manually tuned. In case of cilkBF, the runtime system has been responsible for load balancing.

We can observe that for sufficiently large graphs all parallel implementations utilize multiple cores achieving reasonable speedup (see Fig. 3). Moreover, cilkBF outperforms ompBF significantly, especially on Xeon Phi. This is the effect of the efficient and explicit vectorization of the loop 7–9 in Algorithm 1. For this architecture it is also important to vectorize sufficiently long loops. Indeed, the speedup grows when the maximum degree (i.e. the length of the arrays inv and inw) grows.

It should be noticed that we have also tested another version of cilkBF that uses _Cilk_spawn _Cilk_offload construct and where all data structures have been shared between CPU and coprocessors. Unfortunately, the need for synchronization of Shared Virtual Memory at the end of each iteration (i.e. the loop 16–28) leads to a very large increase in processing time and our implementation with offloading is over \(10{\times }\) slower than cilkBF. However, Shared Virtual Memory is perfect for exchanging irregular data with limited size, when explicit synchronization is not used frequently. Both sides (CPU and coprocessor) should operate on memory allocated locally. Local data can be persisted using more sophisticated techniques (the use of Cilk Plus together with #pragma offload).

4 Conclusions and Future Work

We have shown that Cilk Plus can be very easily applied to parallelize recursively defined problems like Adaptive Simpson’s Integration Rule and such implementation can be easily modified to utilize coprocessors like Intel Xeon Phi. It is also easy to move from OpenMP to Cilk Plus and improve the performance of such algorithms by tuning data structures to utilize hardware (i.e. vector units) of modern multicore and manycore processors. For sufficiently large graphs, our Cilk implementation of Belman-Ford algorithm for solving the single-source shortest-path problem achieves really better performance than corresponding OpenMP versions of the algorithm. Thus, Cilk Plus is a good choice for people who want to concentrate on algorithms and prefer to use simple high-level programming constructs to express parallelism. Of course, it is clear that the use of OpenMP together with more advanced programming tools allows to fine-tune programs for a particular architecture [17, 20]. However, this involves a much greater effort.

In the future, we plan to implement some other important computational problems using Cilk Plus. It would also be interesting and important to find problems that can benefit from using Shared Virtual Memory.