Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

OpenACC is an open standard which defines a collection of compiler directives or pragmas for execution of code blocks on accelerators like GPUs or Xeon Phi coprocessors. OpenACC aims to reduce both the required learning time and the parallelization of sequential code in a portable way [1]. OpenACC specification is currently on its 2.5 version [2], which has been released recently.

OpenACC was founded by Nvidia, CRAY, CAPS and PGI, but now there is a large list of consortium members, both from the industry and academy, including the Oak Ridge National Laboratory, the University of Houston, AMD, and the Edinburgh Parallel Computing Centre (EPCC), among others. The Corporate Officers are, at the time of writing this paper, from Nvidia, Oak Ridge National Laboratory, CRAY and AMD. Academic memberships are available to interested institutions.

There are several compilers supporting OpenACC. The PGI Compiler (from the Portland Group, which is a subsidiary of Nvidia for some time now) is being distributed as part of the Nvidia OpenACC Toolkit, under a free 90-day trial license, a free annual academic license, or a commercial license. The PGI Compiler uses CUDA or OpenCL as backend. CAPS enterprise, a provider of software and services for the High Performance Computing community, also developed a compiler which supports CUDA and OpenCL. However, the company is no longer in business and its development is not available anymore. CRAY Inc. has its own OpenACC compiler, available with their computers. It is reported to be one of the most mature commercial compilers. However, they have not yet offered an academic or trial license for the purposes of studies like this one. Pathscale Inc., a compiler and multicore software developer, also made an OpenACC implementation on their ENZO compiler. Unfortunately, after private communications they were reluctant to allow us to use their compiler for this study.

There are also several academic attempts of developing an OpenACC compiler. In particular, OpenUH [3] developed at the University of Houston, and accULL [4] from Universidad de La Laguna (Spain). Both of them are available for free to anyone interested.

This work presents a study about the level of support of OpenACC in the available compilers, examining their strengths and weaknesses, and giving insights on their performance.

We analyze the level of maturity of each compiler, in terms of their completeness in the support of the standard and robustness, using each compiler’s documentation to check what parts of the specification have been implemented. We check the support of OpenACC compiler directives with the help of a benchmark suite, developed by the Edinburgh Parallel Computing Centre, which will be described later.

Another aspect to test is the relative performance of the generated code. For this, we would want to run more complex applications and measure the differences between the performance of the executable code generated by each compiler. The ideal situation would be to test applications as close to real world problems as possible, avoiding synthetic code fragments. Since the use of OpenACC is not common yet, we have to rely on existing benchmarks. At this point, comparing the results from OpenACC code with CUDA or OpenCL direct implementations might seem appropriate, but porting the different benchmarks to these languages makes the result dependent on human interference, as the developer’s ability for CUDA programming impacts on the performance.

The experiments conducted in this work were carried out using several benchmarks. First, the EPCC benchmark suite which contains a group of 13 kernels ported to OpenACC, called “Level 1” benchmarks. This benchmark suite also contains three real applications called “Himeno”, “27stencil”, and “le_core” [5]. We have also used the Pathscale port of the Rodinia benchmark [6]. We also wanted to use the OpenACC Validation Testsuite [7] developed by the University of Houston, but at this moment that tool is only available for OpenACC members.

Our conclusion is that the different compilers are on their way to a reasonable maturity. However, there is a number of features not fully implemented yet by some of the compilers.

The rest of this paper is organized as follows. Section 2 describes the selected compilers. Section 3 shows the characteristics of the benchmark suites chosen, enumerating some problems encountered when compiling them with the compilers selected. Section 4 contains the result of our analysis, in terms of completeness of the OpenACC features supported, robustness of compiler implementations, and relative performance of the generated code. Finally, Sect. 5 concludes our paper.

2 Available Compilers

In the introduction we mentioned several compilers. In this section we describe with more detail the compilers we were able to obtain and use for this study, and we will discuss their installation particularities on our Linux based platform.

2.1 PGI Compiler

The PGI Compiler [8] is being developed by The Portland Group, being owned by Nvidia. This compiler is widely used in webinars, workshops, and conferences.

The PGI Compiler is, at the time of writing this paper, available for download as part of the OpenACC Toolkit from Nvidia. This toolkit includes a 90-day free trial, the possibility of acquiring an academic license for a whole year, or buying a commercial license.

2.2 accULL

The accULL [4] compiler developed by Universidad of La Laguna (Spain) is an open source initiative. accULL consists on a structure of two layers containing YaCF [9] (Yet another Compiler Framework) and Frangollo [10], a runtime library. YaCF acts as a source-to-source translator while Frangollo works as an interface providing the most common operations found in accelerators.

2.3 OpenUH

The OpenUH [3] compiler, developed by the University of Houston (USA) is another open source initiative. It makes use of Open64, a discontinued open-source optimizing compiler.

3 Benchmark Description

This section describes the different benchmarks used in our work, enumerating the main characteristics that make them interesting for this study, and any issue detected during their compilation with the three compilers studied.

3.1 EPCC OpenACC Benchmarks

This benchmark suite [11] has been developed by the Edinburgh Parallel Computing Centre (EPCC). The benchmarks are divided in three categories: “Level 0”, “Level 1” and “Applications”. The compiled program launches all the benchmarks in the suite sequentially. By default, the number of repetitions is ten, and the result is the average for each benchmark. Time is measured in microseconds in double precision, using the OpenMP function omp_get_wtime(). We describe briefly the benchmarks included bellow:

Level 0. Level 0 includes a collection of small benchmarks that execute single host and accelerator operations, such as memory transfers.

Level 1. Level 1 benchmarks [12] consist on a series of BLAS-type kernels. They are based on Polybench [13] and Polybench/GPU kernels. They measure the performance of executing those codes. These benchmarks are run on the CPU first in order to have results to compare those obtained on the GPU.

A brief description of the different issues found while running this suite for these compilers follows.

OpenUH: The following benchmarks cannot be compiled due to unsupported pragmas present in their code:

  • kernels_if: Problems using #pragma kernels if(0).

  • parallel_private: Problems declaring params as private.

  • parallel_firstprivate: Problems declaring params as firstprivate.

  • le_core: Problems with non scalar pointers.

  • himeno: Problems with non scalar pointers.

accULL: There is a problem related to a function pointer in the host program. The compiler, during the source to source translation modifies the syntax of the function pointer. A double (*test)(void) is converted to a double *(test(void)). This was solved by manually correcting this change in the intermediate C code generated, re-compiling the object file and copying it to the main directory to link all the object files again. No warning from any of the pragmas was detected so we were able to run all the benchmarks.

3.2 Rodinia OpenACC

Rodinia is a benchmark suite for heterogeneous computing [14, 15]. It includes applications and kernels for multicore CPU or GPU applications.

There is an effort to port existing Rodinia benchmarks for OpenACC. Pathscale [6] is working on this. We have tested their Rodinia version committed to GitHub on April 25, 2014. Most of the suite works with PGI, but OpenUH and accULL have many problems to compile most of the tests. We have been able to successfully compile the following benchmarks contained in the suite with two or more compilers: gaussian, nw, lud, cfd, hotspot, pathfinder, and srad2.

4 Evaluation

In this section we analyze the OpenACC compilers, using both documentation and experimentation. We use each compiler’s documentation to check the completeness of OpenACC features supported. Then we use the EPCC benchmarks to check both robustness and relative performance. Finally we check thread-block size sensibility, measuring the impact on performance of different geometries.

4.1 Experimental Setup

We used a Nvidia GTX Titan Black to run the experiments. This GPU contains 2880 CUDA cores with a clock rate of 980 Mhz and 15 SMs. It has 6 GB of RAM, and Compute Capability 3.5. The host is a Xeon E5-2690v3 with 12 cores at a clock rate of 1.9 GHz, and 64 GB in four 12 GB modules.

The PGI compiler is the one contained in the Nvidia OpenACC Toolkit, version 15.7-0, published in Jul 13, 2015. We used OpenUH version 3.1.0 (published in November 4, 2015), based on Open64 version 5.0 and using GCC 4.2.0, prebuilt, downloaded from the High Performance Computing Tools group website [16]. accULL is version 0.4alpha (published in November 28, 2013), downloaded from Universidad de La Laguna’s research group “Computación de Altas Prestaciones” [17].

4.2 Completeness of OpenACC Features Supported

From each compiler documentation we get some insight on the completeness of the OpenACC features supported. From this information, we can conclude that the OpenACC standard is not fully implemented yet by any of the available compilers. There is work to be done, but the three compilers are at a respectable maturity level.

4.3 Robustness and Pragma Implementation

The EPCC Benchmark suite contains several benchmarks for testing OpenACC directives. These benchmarks are contained in the “Level 0” group, which has been described in the previous section. Table 1 contains the results obtained for the three compilers. In this section we enumerate the problems with each benchmark and we explain the results obtained, including the overhead of the different pragma implementations.

Except for Update_host, Kernels_Invoc., and Parallel_Invoc., the time shown is the difference between executing and not executing each pragma. When the overhead is zero (or the pragma is not implemented), the times are very similar, with minimal stochastic variation. These variations may produce a very small negative result when calculating the difference. When differences in time are on the order of tens of microseconds (positive or negative), it can be assumed that there is no difference in time between the different versions tested in that benchmark.

Table 1. EPCC level 0: directive’s overhead (in \(\upmu \)sec), 1 MB dataset

PGI. There was a problem with the “Kernels_Invocation” benchmark: It returned an incorrect result. The code was not being parallelized and the pragmas were ignored because it wasn’t specifically stated that the iterations were independent. This could be solved adding the keyword restrict to the pointer or the clause independent to the pragma.

The “kernels_if” and “parallel_if” results are very similar, and in both cases the results indicate that the code with the pragma is slightly faster than the one without it, even though both are being run on the host. In [5] it was stated that this could be because of optimizations done by the compiler while or after processing the pragmas.

The “parallel_private” benchmark shows that the creation of private variables for each thread running the loop is slightly faster than the allocation of device memory.

“Kernels_combined” shows a very small difference of time between writing two pragmas instead of a combined one, the former being slightly faster than the latter although the difference is almost negligible. The same occurs for the “parallel_combined” benchmark, the difference being smaller in this case.

Finally “Parallel_reduction” and “Kernels_reduction” show that PGI has very little overhead for the reduction clause. In [5] it is stated that the PGI compiler does the reduction even if it is not annotated. This could explain the very small difference in both benchmarks.

OpenUH. We got some errors during compilation of the “Kernels_if”, “Parallel_private” and “Parallel_1stprivate” benchmarks so they are ignored in this analysis. However, the “Parallel_if” directive is supported and the difference between using the pragma to run code on the host or running it directly is almost negligible.

“Kernels_combined” shows an overhead of the combined pragma versus the separated version. However, this is not the case for the “Parallel_combined” benchmark, where the difference is much smaller. The invocation of kernels and parallel directives are very similar. And for both of them, the reduction adds a similar overhead. This might be related to OpenUH assuming loops to be independent inside kernels regions.

accULL. No errors were shown while compiling or running the benchmarks with accULL. There is a big difference between the two versions contained in the “Kernels_if” and the “Parallel_if” benchmarks, where the kernels directive version has a very small overhead compared to the non-annotated code. This overhead is very large in the parallel directive version. This is explained by the accULL developers in [5] where they say that the absence of a loop clause in the parallel directive is causing the loop to be executed sequentially in each thread. Therefore, this clause is not correctly supported, as we understand from the OpenACC Specification that the loop should be executed only on the host.

Robustness Summary. The overall results indicate that some of the clauses are not implemented yet, but the three compilers are in their way to a reasonable maturity level and, since the most used directives are working, they can actually be used for code parallelization using OpenACC.

4.4 Relative Performance of Generated Code

In this section we analyze the performance of the generated code describing the impact of pragmas overhead in accULL. Performance measurement is divided into data movement, where we analyze the results of the data movement benchmarks in Level 0 of EPCC OpenACC Benchmark Suite, and execution performance, using Level 1 and Application Level of EPCC OpenACC Benchmark Suite, and selected benchmarks from Rodinia.

Effect of Pragmas Overhead in accULL. Some results from the Level 0 of the EPCC Benchmark Suite show a performance impact introduced by some clauses and directives in the accULL generated code.

Kernels and Parallel invocations in accULL have a higher overhead than other compilers. This is due to the runtime calls and it is specially noticeable in the reduction clause. These overheads accumulation does not have a significant impact for complex kernels, or launching the same kernel over and over again. However, this could be a problem when running simple kernels or many different small kernels. This is the main reason behind the overall results showing a worse performance of the accULL compiler in this analysis.

Data Movement. Data movement performance can be measured in four benchmarks from the Level 0 of the EPCC OpenACC benchmark suite. We have launched 10 repetitions of those benchmarks with datasizes of 1 kB, 1 MB, 10 MB, and 1 GB. The results can be seen in Tables 2, 3, 4, and 5.

Table 2. EPCC data movement results (in \(\upmu \)sec), 1 kB dataset. White cells highlight the best results. norm. is the normalized result using PGI as reference.
Table 3. EPCC data movement results (in \(\upmu \)sec), 1 MB dataset. White cells highlight the best results. norm. is the normalized result using PGI as reference.
Table 4. EPCC data movement results (in \(\upmu \)sec), 10 MB dataset. White cells highlight the best results. norm. is the normalized result using PGI as reference.
Table 5. EPCC data movement results (in \(\upmu \)sec), 1 GB dataset. White cells highlight the best results. norm. is the normalized result using PGI as reference.
Table 6. EPCC execution results (in \(\upmu \)sec), 1 kB dataset. White cells highlight the best results.
Table 7. EPCC execution results (in \(\upmu \)sec), 1 MB dataset. White cells highlight the best results.
Table 8. EPCC execution results (in msec), 10 MB dataset. White cells highlight the best results.

In [5] it was stated that PGI used pinned memory and that it was causing issues in smaller datasets. It seems that PGI has solved this issue since then and, looking at the documentation, it is now possible to specify the type of memory access we want with a compilation flag. When using large datasets, OpenUH and accULL do not show the expected results according to the evolution shown in Tables 2, 3, and 4. We guess that this is related to the usage of pinned memory by the PGI Compiler, allowing it to obtain better results when datasets are big enough.

Fig. 1.
figure 1

Execution results (in msec).

Execution Performance, EPCC Benchmarks. In this section we will analyze the performance of the code generated by the PGI, OpenUH, and accULL compilers with the benchmarks contained in the EPCC Level 1 and Application level. We use three different datasets: 1 kB, 1 MB, and 10 MB. This choice is based on the fact that bigger datasets result in an out of memory error due to how the benchmarks try to allocate memory on the device. We suspect the memory allocation is being done in each thread inside the generated kernels, using more memory than expected. In summary, PGI code obtains better results in almost every benchmark. However, the differences shorten when using bigger datasets.

For datasets of 1kB the results can be seen in Table 6. Benchmarks that fail to execute with a specific compiler are shown with an asterisk in the table. PGI code shows a very good performance, followed by the OpenUH code, which also behaves quite good. accULL is showing slightly worse results because it is paying a high price in overhead for loading kernels in memory for the first time while this operation is not required for subsequent kernel calls, or it is less noticeable on complex kernels where computation is more time consuming. This situation makes it very hard to compete with other compilers for such small and simple problems. This issue is less noticeable with bigger datasets.

When using the default dataset size of 1 MB, we can see in Table 7 that the differences between PGI and the rest are smaller than when a dataset of 1 kB was used. The increment of time due to the increment of dataset size for these benchmarks is more noticeable for the PGI compiler and, to a lesser extent, for OpenUH. accULL results are very similar to the results obtained with the first dataset of 1 kB. Notice the results obtained for the GEMM benchmark with OpenUH, which is probably being executed sequentially.

Table 9. Rodinia execution time results including memory transfers. Total time (in msec). White cells highlight the best results.
Table 10. Rodinia execution results. Kernel time (in msec) not including memory transfers. White cells highlight the best results.

For datasets of 10 MB, the results can be seen in Table 8. Some benchmarks, for example “2MM” and “GEMM”, need unexpected amounts of time. Running the accULL generated code for 2MM requires one third of the time required by PGI and OpenUH codes. The GEMM benchmark shows a huge execution time, probably for the reason described in the previous paragraph.

The 27 stencil application is the only benchmark in the application level that compiles and runs successfully with all the compilers. It is a representative application of stencil codes that uses a three-dimensional neighbour synchronization pattern. Thus, it is a good representative of a well-known class of applications. In Fig. 1a we show the execution times obtained for this application in the chosen platform. The results show that one of the compilers cannot produce an efficient implementation. However, as it is shown in Fig. 1b, all compilers can derive similar solutions for simpler 3-dimensional stencil codes, as the 3DCONV benchmark of level one.

Execution Performance, Rodinia.

Regarding Rodinia, We should remark that the compilation presented some problems and there was a very limited amount of compiled benchmarks to choose from. The performance results can be seen in Tables 9 and 10. Only the Gaussian benchmark reports results for execution time including and not including memory transfers. Benchmarks that fail to compile with specific compilers are shown with an asterisk in the table.

Here we can see how the code produced by PGI is not behaving as we would expect from the results obtained from the EPCC benchmark suite. PGI code is taking a lot of time in data transfer operations, while OpenUH code is not expending so much time in those operations. We discovered that the EPCC Benchmark Suite runs a function which contains an OpenACC pragma in order to make sure that the accelerator device is awake when the real benchmarks are run. This does not happen in Rodinia where the first OpenACC section is the data movement pragma. PGI generated code needs more than two seconds to set up the accelerator device, whereas OpenUH and accULL don’t need that time. Beside this, accULL code is having some trouble running Gaussian as the time is not being spent in data transfer operations, but inside the generated kernels. We suspect that they might be executing sequential code in the host, due to a fail of the compiler or execution run-time to properly use the GPU. However, accULL obtains the best results in Needle-Wunsch.

Relative Performance Summary. Results indicate a better performance for the code generated by PGI for the simple codes in the EPCC benchmarks, but not for the Rodinia applications. OpenUH generated code is not affected by any noticeable overhead and its performance is very close to PGI code. In order to analyze accULL code results it is important to take into account the overhead produced by kernel loading operations. If a bigger input set was used, results could be much better, but due to limitations on benchmark implementation this was not possible at this time.

5 Conclusions

During this work, we have realized that both the OpenACC standard and its compiler implementations are in their way to a reasonable maturity level. Although many details are still not completely developed, the efforts to arrive at a solid implementation are promising. Nvidia and PGI are devoting many resources to this project, and this results in a very competitive and solid compiler. However, open-source alternatives are also on a good position. OpenUH and accULL, being academic implementations, are also very interesting and show a huge amount of work done by their creators.

Regarding completeness of OpenACC features, according to each compiler’s documentation, we find that none of them fully support the standard. This was expected as all of them are still not totally mature.

Speaking about robustness and pragma implementation, PGI shows the best behaviour, as the errors detected were related to implementation issues of the benchmark codes instead of a compiler problem. Compared to the others, the overhead of the implementation is smaller and it even includes some optimizations when processing pragmas that are going to be executed in the host.

Finally, the performance comparison we have made shows better results for PGI, but the other alternatives have also shown their strengths. It would also be interesting to run this performance analysis on several machines and different GPUs in order to also observe the differences of the execution of the generated code in different hardware, and this is part of our future work. The lack of OpenACC benchmark suites makes it very difficult to try different problems or datasizes. Our work shows that there is a need for real application codes annotated with OpenACC pragmas to test the actual potential of the current compiler implementation, as many articles before [5, 18, 19] have stated. This is part of our current and future work.