Keywords

1 Introduction

OpenMP and MPI have become the standard tools to develop parallel programs on shared-memory and distributed-memory architectures respectively. As compared to MPI, OpenMP is easier to use. This is due to its ability to automatically execute code in parallel and synchronize results using its directives, clauses, and runtime functions while MPI requires programmers to do all this manually. Therefore, some efforts have been made to port OpenMP on distributed-memory architectures. However, excluding CAPE [7, 9, 18], no solution has successfully met these two requirements: (1) to be fully compliant with the OpenMP standard and (2) high performance. Most prominent approaches include the use of an SSI [15], SCASH [19], the use of the RC model [13], performing a source-to-source translation to a tool like MPI [1, 5] or Global Array [12], or Cluster OpenMP [11].

Among all these solutions, the use of a Single System Image (SSI) is the most straightforward approach. An SSI includes a Distributed Shared Memory (DSM) to provide an abstracted shared-memory view over a physical distributed-memory architecture. The main advantage of this approach is its ability to easily provide a fully-compliant version of OpenMP. Thanks to their shared-memory nature, OpenMP programs can easily be compiled and run as processes on different computers in an SSI. However, as the shared memory is accessed through the network, the synchronization between the memories involves an important overhead which makes this approach hardly scalable. Some experiments [15] have shown that the larger the number of threads, the lower the performance. As a result, in order to reduce the execution time overhead involved by the use of an SSI, other approaches have been proposed. For example, SCASH maps only the shared variables of the processes onto a shared-memory area attached to each process, the other variables being stored in a private memory, and the RC model that uses the relaxed consistency memory model. However, these approaches have difficulties to identify the shared variables automatically. As a result, no fully-compliant implementation of OpenMP based on these approaches has been released so far. Some other approaches aim at performing a source-to-source translation of the OpenMP code into an MPI code. This approach allows the generation of high-performance codes on distributed-memory architectures. However, not all OpenMP directives and constructs can be implemented. As yet another alternative, Cluster OpenMP, proposed by Intel, also requires the use of additional directives of its own (ie. not included in the OpenMP standard). Thus, this one cannot be considered as a fully-compliant implementation of the OpenMP standard either.

CAPE used the Discontinuous Incremental Checkpointing (DICKPT) [8] to implement the OpenMP fork-join model. The jobs of OpenMP work-sharing constructs are divided and distributed to slave nodes using checkpoints. At each slave node, these checkpoints are used to resume execution. In addition, the results after executing the divided jobs on each slave node are also extracted using checkpoints and sent back to the master. It has been demonstrated that this solution is fully compliant with OpenMP and provides high performance. However, there are some limitations:

  • to run on top of CAPE, an OpenMP program must fulfill the Bernstein’s conditions. This is the reason why the matrix-matrix product has been extensively used in the previous experiments.

  • The implementation of CAPE wastes the resources. In the implementation of OpenMP work-sharing constructs on CAPE, the master does not perform a part of the computation. It waits for checkpoint results from the slave nodes and merges them together.

  • The risk of bottlenecks and low communication performance at the implementation of the join phase. After executing the divided jobs, each slave node extracts a result checkpoint and sends it back to the master. The master receives, merges checkpoints together and sends the result back to the slave nodes in order to synchronize data.

This paper presents the design and implementation of a new model for CAPE based on Time-stamp Incremental Checkpointing (TICKPT) [24] to bypass the drawbacks mentioned above. The new implementation based on TICKPT improves the performance, capability, and reliability of this solution.

2 Checkpoint Techniques

2.1 Checkpointing

Checkpointing is the technique that saves the image of a process at a point during its lifetime, and allows it to be resumed from the saving’s time if necessary [4, 17]. Using checkpointing, processes can resume their execution from a checkpoint state when a failure occurs. So, there is no need to take time to initialize and execute it from the begin. These techniques have been introduced for more than two decades. Nowadays, they are used widely for fault-tolerance, applications trace/debugging, roll-back/animated playback, and process migration. To be able to save and resume the state of a process, the checkpoint saves all necessary information at the checkpoint’s time. It can include register values, process’s address space, open files/pipes/sockets status, current working directory, signal handlers, process identities, etc. The process’s address space consists of text, data, mmap memory area, shared libraries, heap, and stack segments. Depending on the kind of checkpoints and its application, the checkpoint takes all or some of these information.

Based on the structure and contents of the checkpoint file, checkpointings are categorized into two groups: complete and incremental checkpointing.

  • Complete checkpointing [3, 4, 14] saves all information regarding the process at the points that it generates checkpoints. The advantages of this technique are the reduction of the time of generation and restoration. However, not only a lot of duplicated data are stored each time a checkpoint is taken, there are also duplications in the different generated checkpoints.

  • Incremental checkpointing [8, 10, 17] only saves the modified data. This has to be compared with the previous checkpoint. This technique reduces checkpoint’s overhead and checkpoint’s size. Therefore, it is widely used in distributed computing.

2.2 Time-Stamp Incremental Checkpointing

Time-stamp Incremental Checkpointing (TICKPT) [24] is an improvement of DICKPT by adding new factor – time-stamp – into incremental checkpoints and by removing unnecessary data based on data-sharing variable attributes of OpenMP programs.

Basically, TICKPT contains three mandatory elements including register’s information, modified region in memory of the process, and their time-stamp. As well as DICKPT, in TICKPT, the register’s information are extracted from all registers of the process in the system. However, the time-stamp is added to identify the order of the checkpoints in the program. This contributes to reduce the time for merging checkpoints and selecting the right element if located at the same place in memory. In addition, only the modified data of shared variables are detected and saved into checkpoints. It makes checkpoint’s size significantly reduced depending on the size of private variables of the OpenMP program.

Fig. 1.
figure 1

New abstract model for CAPE.

3 CAPE Based on TICKPT

3.1 Abstract Model

Figure 1 presents the new abstract model for CAPE. It is designed based on TICKPT and uses MPI to transfer data over the network.

As presented in the previous version [21, 22], CAPE provides a set of prototypes to translate OpenMP codes into CAPE codes. An OpenMP CAPE code in C or C++ is replaced by a set of calls to CAPE runtime functions. In this version, the CAPE translator prototypes are modified and added to adapt to the new mechanism based on TICKPT. This provides a set of prototypes to translate the common constructs, clauses, and runtime functions of OpenMP.

For the CAPE Runtime library, apart from providing functions to handle OpenMP instructions and to port them on distributed memory systems, some functions have been added to manage the declaration of variables and the allocation of memory on the heap. To transfer data among nodes in the system, instead of using the functions based on sockets like in the previous version, MPI_Send and MPI_Recv functions are called to ensure high reliability.

Fig. 2.
figure 2

cape_flush() implementation.

3.2 RC-Model Based CAPE Memory Model Implementation

OpenMP uses the Relaxed Consistency (RC) memory model. This model allows shared memory allocated in the local memory of a thread to improve memory accesses. When a synchronization point is reached, this local memory is updated in the shared memory area that can be assessed by all threads.

CAPE completely implements the RC model of OpenMP on distributed-memory systems. All variables, including private and shared variables, are stored at all nodes of the system, and they can be only accessed locally. At synchronization points, only the modified data of shared variables at each node are extracted and saved into a checkpoint. This checkpoint is sent to the other nodes in the system, and is merged using the merging checkpoint operation with the other. Then, the result checkpoint is injected into the application memory to synchronize data.

In the CAPE runtime library, there are two fundamental functions which are called implicitly at synchronization points:

  • cape_flush() generates a TICKPT, gathers, merges, and injects them into the application memory. This function is described by pseudo code in Fig. 2. Here, the all_reduce() function is responsible for gathering and merging the checkpoints generated by the generate_checkpoint() function. The gathering and the merging is implemented using both Ring and Recursive Doubling algorithm. The algorithm is automatically selected to be executed by the system depending on the size of the checkpoint.

  • cape_barrier() sets a barrier and updates shared data between nodes. This function calls MPI_Barrier() of the MPI runtime library, and then uses cape_flush() to update shared data.

3.3 Execution Model

Figure 3 illustrates the execution model of CAPE. The idea of this model is the use of TICKPT to identify and synchronize the modified data of shared variables of the program among the nodes. OpenMP threads are replaced by processes, and each process runs in a node. At the beginning, the program is initialized and executed at the same time in all nodes of the system. Then, the execution works as the following rules:

Fig. 3.
figure 3

The new execution model of CAPE.

  • The sequential region or the code inside the parallel construct but not belonging to any other constructs is executed in the same way for all nodes.

  • When the program reaches a parallel region, on each node, CAPE detects and saves the properties of all shared variables that are implicitly declared as sharing. If there are any OpenMP clauses declared in the parallel construct, the relevant runtime functions are called to modify variable properties. Then, the start directive of TICKPT is called to save the value of the shared variables.

  • At the end of a parallel region, the implementation of the barrier construct is implicitly called to synchronize data, and the stop directive of TICKPT is called to remove all relevant data.

  • For the loop construct, each node (including the master node) is responsible for computing a part of the work based on the re-calculation of the range of iterations.

  • For the sections construct, each node is divided into one or more parts of works that are indicated using section construct.

  • At the barrier, the implementation of the flush construct is called to synchronize data.

  • When the program reaches the flush construct, a TICKPT is generated and synchronized among the nodes to update the modification of shared data. According to [16], a flush is implicit at the following locations:

    • At the barrier.

    • At the entry to and the exit from parallel, critical, and atomic constructs.

    • At the exit from for, sections, and single constructs unless a nowait clause is present.

In this execution model, instead of using the master node to divide jobs and distribute to slave nodes based on incremental checkpoints in order to implement OpenMP work-sharing constructs, each node computes and executes the divided jobs automatically. At synchronization points, a TICKPT is generated at each node. It contains the modified data of shared variables and their time-stamps after executing the divided jobs. These checkpoints are gathered and merged at all nodes in the system using the Ring or Recursive Doubling algorithm [20]. This allows CAPE to void the bottleneck and improve the performance of communication tasks.

With the features of TICKPT, checkpoints are able to use checkpoint’s operations [23, 24]. This allows memory elements to share the same address when computing and makes it simple when merging. Therefore, it allows CAPE to work without the need for the program to match with the Bernstein’s conditions. Moreover, the master node takes a part in the computation of the divided jobs. This uses all the resources and improves the system efficiency.

The only missing part of the OpenMP specifications for this implementation is that dynamic and guided scheduling directives of the work-sharing construct have not been implemented yet. However, one can demonstrate that they can be easily translated into a static scheduling.

3.4 Prototypes

To be executed on a distributed-memory system with the support of the CAPE runtime library, the OpenMP source code is translated into a CAPE source code. There, each construct, clause, and runtime function of the OpenMP source code is translated into the relevant runtime function of CAPE. This translation works under the provision of a set of CAPE prototypes.

Based on the general syntax of OpenMP directives, a general template for CAPE prototypes was designed and is illustrated in Fig. 4. They are as follows:

Fig. 4.
figure 4

General template for CAPE prototypes in C/C++.

  • cape_begin() and cape_end() are CAPE runtime functions which perform the actions for entering and exiting OpenMP directives. The directive-name is a label declared by CAPE which corresponds to the relevant CAPE runtime function. Depending on this label, the cape_barrier() function is called to update the shared data of the system. param-1 and param-2 are used to store the range of iterations for for loops, otherwise they both are set to zero. The reduction-flag is set to TRUE if there is a declaration of OpenMP reduction clause, otherwise it is set to FALSE.

  • cape_clause_functions is a set of CAPE runtime functions which is used to implement OpenMP clauses. This implementation is presented in [23].

  • ckpt_start() marks the location where to start the checkpointing. When reaching the ckpt_start() function, the value of shared variables is copied.

4 Experiments

In order to evaluate the performance of this new approach, we designed a set of micro benchmarks and tested them on a Desktop Cluster. The designed programs are based on the Microbenchmark for OpenMP 2.0 [2, 6]. These programs have been translated to CAPE and executed on a Cluster to compare the performance.

Fig. 5.
figure 5

OpenMP function to compute vectors using sections construct.

4.1 Benchmarks

(1) MAMULT2D: This program computes the multiplication of two matrices. Originally, it was written in C/C++ and used the OpenMP parallel for construct. It matches Bernstein’s conditions. Therefore, it has been used extensively to test CAPE in the previous works.

(2) PRIME: This program counts the number of prime numbers in the range from 1 to N. The OpenMP code uses the parallel for construct with data-sharing clauses.

(3) PI: This program computes the value of PI by mean of the numeric integration method using Eq. (1).

$$\begin{aligned} \pi = \int _{0}^{1} \frac{4}{1+ x^{2}} dx \end{aligned}$$
(1)

(4) VECTOR-1: This program performs operations on vectors. It contains OpenMP runtime functions, data-sharing clauses, a nowait clause, and parallel and sections constructs. The OpenMP code is presented in Fig. 5.

(5) VECTOR-2: This program performs some operations on vectors. It contains OpenMP parallel and for constructs with a nowait clause. The OpenMP code is shown in Fig. 6.

Fig. 6.
figure 6

OpenMP function to compute vectors using for construct.

4.2 Experimental Environment

The experiments have been performed on a 16-node cluster with different computer’s configurations. There are two computers with Intel(R) Pentium(R) Dual CPU E2160 at 1.80 GHz, 2 GB of RAM, 5 GB of free HDD; seven computers with Intel(R) Core(TM)2 Duo CPU E7300 at 2.66 GHz, 3 GB of RAM, 6 GB of free HDD; five computers with Intel(R) Core(TM) i3-2120 CPU at 3.30 GHz, 8 GB of RAM, 6 GB of free HDD; and two computers including an AMD Phenom(TM) II X4 925 Processor at 2.80 GHz, 2 GB of RAM, 6 GB of free HDD. All machines are operated by the Ubuntu 14.03 LTS operating system with OpenSSH-Server and MPICH-2. They are interconnected by a 100 Mbps LAN network.

Fig. 7.
figure 7

Execution time (in milliseconds) of MAMULT2D with different size of matrix on a 16-node cluster.

4.3 Experimental Results

Figures 7 and 8 present the execution time in milliseconds for the MAMULT2D program for various size of matrices and different sizes of cluster respectively. Note that, there are many kinds of processors in different nodes. Some of them include many cores, but a single core was used for each node during the experiments. Three measures are presented at each time: the left one (yellow) for CAPE-DICKPT (the previous version), the middle one (blue) for CAPE-TICKPT (the current version), and the right one (red) for MPI.

Figure 7 presents the execution time for various matrix sizes on a 16-node cluster. The size increases from 800x800 to 6400x6400. The figure shows that the execution times of all methods are proportional to the matrix size. It also shows that the execution time of CAPE-TICKPT is much lower than the one of CAPE-DICKPT and MPI (around 35%) while the execution time of CAPE-TICKPT and MPI are roughly equal.

Fig. 8.
figure 8

Execution time (in milliseconds) of MAMULT2D for different cluster sizes.

Figure 8 presents the execution time for a matrix size of 6400x6400 on different cluster size. The number of nodes is successively 4, 8, and 16. The result presented in this figure also shows the similar trend for different matrix size. The execution time of CAPE is significantly reduced so that it is now much closer to an optimized human-written program using MPI.

To demonstrate that the new version of CAPE can run OpenMP programs that do not match with the Bernstein’s conditions while achieving high performance, other experiments were conducted and performance were compared with MPI. All of the four other programs presented in Sect. 4.1 have been used to measure the execution time.

Figure 9 presents the execution time in milliseconds of PRIME with \(N = 10^6\) on different cluster sizes for CAPE-TICKPT and MPI. It shows that the execution time of MPI is only around 1% smaller than the one of CAPE-TICKPT. In this experiment, the OpenMP parallel for directive with the shared, private and reduction clauses are translated and tested for both methods. Table 1 describes the steps executed by the program for both methods. The main different step is the join phase. It gathers the results from all nodes and computes their sum. For the MPI program, the user needs to clearly specify the values that need to be gathered, and then call the MPI_Reduce() function after to compute the sum. CAPE-TICKPT automatically identifies the modified value of the shared variables, extracts them into a TICKPT, and then gathers all checkpoints from all the nodes with the merging checkpoint operator. However, as the execution time of CAPE-TICKPT is nearly equal to the one of MPI, we consider that we successfully obtained high performance with CAPE.

Table 1. Comparison of the executed steps for the PRIME code for both CAPE-TICKPT and MPI.

Figure 10 presents the execution time in milliseconds of PI with a number of steps equal to \(10^8\) for different cluster sizes using CAPE-TICKPT and MPI. In this experiment, the OpenMP for directive with reduction clause placed inside the omp parallel construct with some clauses are tested. As well as the previous experiments, this figure also shows that CAPE-TICKPT achieves similar performance as MPI.

Figure 11 shows the execution time in milliseconds for the VECTOR-1 program with \(N = 10^6\) for different cluster sizes using CAPE-TICKPT and MPI. In this experiment, OpenMP functions and the sections construct with two section directives are tested. The figure shows that the larger the number of nodes, the longer the execution time for both methods. The execution time with MPI is smaller than the one of CAPE-TICKPT, but the difference is not significant. Note that there are only two section directives in this program, so that both CAPE-TICKPT and MPI distribute the execution to two nodes only. Each node receives and executes the code of a section. However, the result has to be synchronized to all nodes on the system. Therefore, the execution time increases when increasing the number of nodes.

Fig. 9.
figure 9

Execution time (in milliseconds) of PRIME on different cluster sizes.

Fig. 10.
figure 10

Execution time (in milliseconds) of PI on different cluster sizes.

Fig. 11.
figure 11

Execution time (in milliseconds) of VECTOR-1 on different cluster sizes.

Fig. 12.
figure 12

Execution time (in milliseconds) of VECTOR-2 on different cluster sizes.

Figure 12 shows the execution time in milliseconds for VECTOR-2 with \(N = 10^6\) and \(M = 1.6 \times 10^6\) on different cluster sizes for both CAPE-TICKPT and MPI. This experiment aims at testing two omp for directives with nowait clause. The size of the two vectors are different from each other to ensure the nodes take different time to execute the divided jobs. The execution on each node is marked nowait until reaching the end block of the parallel region. The figure shows the same trend as the previous experiments. The execution time for CAPE-TICKPT is very close to MPI, the difference being negligible.

5 Conclusion and Future Works

This paper presented the design and implementation of a new execution model and prototypes for CAPE based on TICKPT. With this new capability included, CAPE improves the reliability and can run OpenMP programs that do not require to match the Bernstein’s conditions. In addition, the analysis and evaluation of performance of this paper demonstrated that CAPE-TICKPT achieves performance very close to a comparable human-optimized hand-written MPI program. This is mainly due to the fact that CAPE-TICKPT takes benefits of the advantages of TICKPT such as checkpoint operators and can use resources more efficiently. The synchronization phase of the new execution model also avoids the risk of bottlenecks that may have occurred in the previous version.

In the near future, base on this mechanism, we will keep on developing the CAPE framework in order to support other OpenMP constructs. Furthermore, we expect to develop CAPE for GPUs.