1 Introduction

Parallel Computing is widely utilized to exploit the benefits of accumulated compute resources and the possibility of running multiple tasks simultaneously. However, as of today, the vast majority of scientists still prefer to write sequential codes which can only be executed on a single processor. These programs are parallelized later on either by manually rewriting them or by using automatic parallelizers. The latter take as input the initial sequential version of the program and generate, automatically, an equivalent parallel version.

Parallelizing a program can however be very challenging. For a given sequential program, a large number of equivalent parallel programs can be generated. The best parallelization strategy depends on numerous factors, including the target architecture, problem size and problem type. More specifically, a parallel version consists of multiple sub-programs executed on different compute resources simultaneously, potentially containing data-dependencies between the tasks. For example, if a processor needs an intermediate result in order to carry on its calculations, and this intermediate result is calculated by another task on a different processor, the first processor that requires the data has to wait for the second processor to either explicitly provide the data through message passing (distributed memory programming); or indirectly by storing the data in a memory space that both processors have access to (shared memory programming). Since the performance costs of shared memory communication is considered to be low compared to explicit message passing between processors that require to use the network interconnects, using both shared memory and distributed memory programming models at the same time provides a hierarchy that allows to solve larger problems more efficiently compared to using a single parallel model. To maximize the performance benefit of parallelization, one has to minimize the overhead of the communication operations that the parallelization process produces. Over the years, different programming languages, libraries, and parallel programming models have been developed for programming parallel computers. For shared memory architectures, where programming languages communicate by manipulating shared memory variables, POSIX Threads [34] and OpenMP [12] are two of the most widely used Application Programming Interfaces (APIs). Whereas for distributed memory architectures, the Message Passing Interface [20, 31] (MPI) is the most widely used API.

In MPI, collective communication operations have been a key concept used in large scale parallel applications to minimize communication costs. The interface specification of collective operations represents a higher level of abstraction for often occurring communication patterns, separating the desired outcome of the group communication from its actual implementation. They allow internally for numerous optimizations and are therefore essential for scalability of applications at large process counts, as well as to maintain the readability of the parallel program.

Non-blocking individual and collective operations (NBCs)—the latter introduced in version 3 of the MPI specification—allow to overlap communication with computation and enable the user to move the communication and into the background, and perform useful computations instead. Consequently, the performance gain is strongly tied to the ability to overlap communication and compute operations.

This paper presents a generic approach founded upon a model-based parallel code generator and auto-tuning techniques to optimize the communication–computation overlap. Three main goals are maintained: (i) Applicability, which is the scope of HPC applications that the approach can be applied to (ii) Portability, which represents the breadth of architectures that the approach targets (iii) Effectiveness, which represents the potential amount of communication–computation overlap that the approach produces.

This is achieved by enhancing an abstract state-of-the-art C to C+MPI+OpenMP automatic code parallelizer that is initially limited to partial and straightforward usage of blocking collective operations. Our approach allows the user to not only bypass the complex steps towards integrating NBCs into parallel codes mentioned previously, but also benefit from elaborate optimizations, through leveraging run-time auto-tuning techniques. Four application benchmarks were used to demonstrate the efficiency and versatility of the approach on two different platforms. The results indicate significant performance improvements in virtually all test scenarios. The resulting parallel applications achieved a performance improvement in the range of 32–43% compared to the version using blocking communication operations, and up to 95% of relative overlap ratio identified for each scenario.

The remainder of the paper is organized as follows: Sect. 2 presents the most relevant related work. Section 3 discusses necessary background notions related to communication–computation overlap and automatic parallelization, as well as the challenges of using non-blocking collective operations. Section 4 introduces the main contribution of this paper, namely the extension of an automatic parallel code generator to maximize communication–computation overlap by integrating an auto-tuner for non-blocking collective operations. Section 5 presents the performance results obtained with our approach, and finally, Sect. 6 summarizes this paper and discusses the future work.

2 Related Work

Non-blocking collective operations have been recently standardized by the Message Passing Interface (MPI) Forum. However, efficient designs offered by the MPI communication run-times are likely to be the key factors that drive their adoption.

In order to maximize the overlap of communication and computation, many thread-based solutions for non-blocking collective operations have been proposed [22, 24, 27]. They rely on spare cores in each node to progress non-blocking collectives. In [27], the solution relies on one thread per node to execute the non-blocking collective operations on behalf of the application processes. However, this requires additional memory resources and involve expensive copy operations between the processes and the threads dedicated to perform the non-blocking collective operation. These factors limit the overall performance and scalability benefits associated with using non-blocking collectives in MPI. Moreover, the processes that share the compute resources with the dedicated threads are delayed, which usually leads to an overall load imbalance. The load imbalance is even higher on clusters that run a large number of processes per node, where the task-list of non-blocking collective operations per node becomes larger.

Kandalla et al. [28,29,30] and Inozemtsev et al. [25] focus on hiding the latency of collective operations by offloading them to the networking hardware (mostly InfiniBand) in order to overlap communication and computation, thereby using an independent mean of communication progression. However, since these methods require specialized hardware support for asynchronous communication, portability can be restricted. They also have currently performance and scalability limitations, most notably due to the fact that the memory size and processing speed of the network hardware is generally limited compared to those of CPUs. If the task-list is too large, delegating the progression of non-blocking collective operations to the network hardware support could therefore lead to problems [25]. Consequently, hardware-based approaches do not scale with the increase of number of concurrent non-blocking operations.

There are a number of generic auto-tuning frameworks such as the Abstract Data and Communication Library (ADCL [19, 32]), OpenTuner [3], and Active Harmony [35]. As part of the Active Harmony project, Song et al. [33] performed tuning of 3D Fast Fourier Transforms using non-blocking all-to-all collective communication operation. Their work focuses on tuning the parameters of the 3D Fast Fourier Transform in order to maximize the overlap of communication and computation and thus does not auto-tune the non-blocking all-to-all collective operation itself. The fundamental problem in the analysis of non-blocking collective operations stems from the fact that it is virtually impossible to capture the ‘visible’ part of the operation. It is therefore impractical to statically auto-tune non-blocking collective operations. Furthermore, it is therefore impossible to produce a consistent performance model (LogGP [2] for instance) for the visible part of a non-blocking collective operation, or estimate the performance before an actual execution.

3 Background

In this section, metrics for communication–computation overlap assessment are presented first, followed by a description of the main factors affecting communication–computation overlap. The section then introduces the two software packages that this work is based on: PLUTO, an automatic parallelization framework, and ADCL, an auto-tuning framework for blocking and non-blocking collective operations.

3.1 Metrics for Communication–Computation Overlap

Overlapping communication and computation is generally considered a powerful concept for scaling applications to large number of processes. To evaluate the amount of communication–computation overlap achieved throughout the execution of a parallel application, multiple metrics are defined in this section. Given the overall execution time of a code region, \(T_{exec}\), the time spent in communication operations, \(T_{comm}\), and the time spent in computation operations, \(T_{comp}\), we define the following metrics:

  • Potential Overlap Ratio (POR): maximum theoretical overlap ratio in case of a ‘perfect’ communication–computation overlap. POR is defined as:

    $$\begin{aligned} POR = \frac{min\left( T_{comp},T_{comm}\right) }{T_{comp}+T_{comm}} \end{aligned}$$
    (1)
  • Actual Overlap Ratio (AOR): relative amount of time saved during the actual execution. AOR can be calculated as:

    $$\begin{aligned} AOR = \frac{\vert T_{comp}+T_{comm} - T_{exec} \vert }{T_{comp}+T_{comm}} \end{aligned}$$
    (2)
  • Relative Overlap Ratio (ROR): relative overlap achieved during the actual execution, and is calculated as

    $$\begin{aligned} ROR = \frac{AOR}{POR} = \frac{\vert T_{comp}+T_{comm} - T_{exec} \vert }{min\left( T_{comp},T_{comm}\right) } \end{aligned}$$
    (3)

The ROR represents an appropriate evaluation measure by determining the proportion of the observed overlap to the theoretical overlap. Since the time spent in communication and compute operations does not change between the versions, the execution time \(T_{exec}\) is value in the formula above that would be effected by the improvement due to overlapping communication and computation. This measure is used later on in the evaluation section of this paper.

3.2 Factors Affecting Communication–Computation Overlap

Multiple factors affect the communication–computation overlap. Although many of the arguments listed in this sub-section also apply to non-blocking individual communication operations, the discussion focuses on non-blocking collective operations, since it is the focus of this paper.

3.2.1 Overlapping Code Sections

Non-blocking-collective (NBC) operations consist of three phases: initialization, progression, and completion. Communication–computation overlap occurs between the initialization and the completion of an NBC. If an NBC can be executed along another code section and wait for completion before the communicated data is required, the code section is called an overlapping code region (or section).

In order to hide the costs of NBCs, the user has to find overlapping code sections that comprise sufficient amounts of computations to cover the time spent in the corresponding NBCs. This can be very complex, especially if the computational part contains many data-dependencies. To circumvent this problem, manual code changes can be applied, using techniques such as double-buffering and look-ahead techniques [21, 23]. However, these ad-hoc manual transformations require sophisticated schemes that are highly dependent on the computational problem.

If the computational part of the overlapping sections is larger than the time spent in the corresponding NBCs, the scalability of the parallel application will often be limited. This is due to the fact that using larger sub-problems often prevents finer-grained parallelism, which simultaneously uses more compute resources. In the ideal scenario, the time spent in computations should be equal to the time spent in the corresponding NBCs. This is often dependent on the size of the sub-problems, commonly referred to as block size, and adjusting the block size is an essential factor in achieving satisfactory communication–computation overlap.

Consequently, optimal-overlapping code sections highly depends on the parallelization strategy applied initially.

3.2.2 Underlying Algorithms

Another challenge in optimizing NBCs stems from the impact that both the hardware and system being utilized, as well as the application characteristics have on the performance of NBCs. From the system level perspective, factors include networking technology and network topology, processor type and organization, memory hierarchies, operating system versions, device drivers, and communication libraries. In fact, one can argue that every parallel computer is (nearly) unique. Similarly, different parallel applications expose very different communication characteristics, such as frequency of data transfers, volume of data transfer operations, and process arrival patterns (i.e. the time difference on how processes enter a collective operation due to-micro-load imbalances between the processes [14]).

Similarly to their blocking counter parts, NBCs can be implemented in different equivalent ways. For example, a non-blocking broadcast can be implemented using a linear algorithm, a binary tree algorithm, etc. NBCs have to be carefully tuned for a given platform and application to maximize their performance [4].

3.2.3 Progression

Progression is another important aspect of non-blocking operations. The performance of NBCs is closely tied to its progress in the background. There are three ways to ensure background progression:

  • Using a progress thread: each process spawns a separate thread that executes the non-blocking operation in a blocking manner. While this solution is intuitive and simple, it has in practice multiple drawbacks, including how to limit the number of threads used by the library, and how to progress non-blocking individual communication operations.

  • Offloading to the hardware: some networking cards include hardware support for collective operations which allows execution in the background (apart from the processes), and enables communication and computation overlap since the processes are not blocked. However, portability is restricted due to the need of specialized hardware.

  • Using a progress engine: by regularly invoking progress/completion functions (similarly to MPI_Test) at the user-level, the MPI library advances the underlying non-blocking point-to-point operations that constitute the non-blocking-collective operation.

As of today, using a progress engine is the most widely used option because making an MPI library thread safe is highly challenging. The application has to regularly invoke the progress engine of the MPI library [22].

Depending on the application, the locality and frequency of progress-function calls play a significant role on the overall performance. In fact, the overlap of communication and computation may be impaired if, for example, a non-blocking-collective operation is delayed awaiting for a progress-function call to trigger the next steps of the operation. On the other hand, since a progress-function call usually involves intrinsic and performance-costly MPI routines, it can create a performance overhead if called too often [4]. Computational overlapping code sections have to be flexible enough for progress-function calls to be inserted at optimal locations within the code sequence.

3.2.4 Overlapping Multiple NBCs

Concurrent NBCs can potentially interfere with one another, especially if they operate on the same network interconnect. Interfering NBCs have to be optimized as a whole, because the outcome of a local (separate) optimization is not the same outcome that a global optimization would provide.

In summary, NBCs provide a promising way to improve the scalability of parallel applications, their utilization is challenging. The developer has to follow a complex repetitive cycle to efficiently integrate NBCs into parallel codes:

  • Apply the parallelization strategy.

  • Select the algorithm that minimizes the execution time of the operation.

  • Place progress-function calls at optimal locations within overlapping regions.

  • Enforce data-dependencies and avoid deadlocks (in case of overlapping NBCs).

  • Tune interfering (overlapping) NBCs as a whole (co-tuning).

These require a high user expertise on a multitude of software libraries, hardware platforms, and the objective application itself.

3.3 PLUTO

The main goal of this work is to optimize the communication–computation overlap starting from the earliest stage-sequential codes-based on a parallel model. Using a parallel model enables a generic approach, and provides a consistent control over the four main factors that affect the communication–computation overlap in the earliest stages of parallelization.

Nevertheless, it is beyond the scope of this paper to design and implement a new model-based parallel-code generator that supports the goals of this paper from scratch. We investigated different state-of-the-art parallel models and corresponding parallel-code generators, and found that PLUTO [9, 10] provides the best starting point to achieve the goals of this paper.

PLUTO is a parallel-code generator based on the Polyhedral Model [15,16,17]. The Polyhedral Model is an abstract representation of compute-intensive affine-loop nests. It performs polyhedral transformations, such as tiling and skewing [5], to generate coarse-grained parallel codes. The skewing transformation shifts the data-dependencies by skewing the grid in appropriate directions. As a consequence, the tiling transformation can be applied without violating data-dependencies and in a more flexible and efficient way by carving the grid into independent blocks. Tiling optimizes both the spacial and temporal localities of the data. Tiling improves data locality by dividing the grid into coalesced blocks of cells (or elements) and by causing disparate redundant data accesses to be close to each other within the produced tiles. If a tile is small enough to fit into the compute unit’s cache, the performance overhead can dramatically diminish due to these redundant data accesses. The core transformation framework of PLUTO finds affine transformations for efficient tiling. Hybrid codes including MPI and OpenMP instructions (for scalability on heterogeneous architectures) can be automatically generated from sequential C program sections.

Stencil codes are a class of iterative kernels which update array elements by fixed patterns, called stencils. They are very common in the context of scientific and engineering applications, such as computational fluid dynamics, solving partial differential equations, image processing, and cellular automata. Since computing time and memory consumption grow linearly with the number of array elements, parallel implementations of stencil codes are of paramount importance to research. This is challenging since the computations are tightly coupled (cell updates depend on neighboring cells) and most stencil codes are memory bound, i.e. the ratio of memory accesses to calculations is high.

Most of sequential stencil codes correspond to compute-intensive affine-loop nests. This type of loop nests is called Static Control Parts (SCoP [6]), which represents the category of sequential programs that applies to PLUTO.

PLUTO is the only model-based parallel-code generator that we are aware of that targets such a large category of scientific problems and generates hybrid codes using MPI and OpenMP. Applications that do not fall into the category of affine-loop nests might still benefit of the approach discussed in this paper. Recent work by Venkat et al.  [36] demonstrates how to incorporate applications that can not express loop upper bounds and control conditions as a linear affine expressions into the polyhedral model (e.g. sparse matrix operations). Thus, our solution could be applied in these scenarios as well. The approach discussed in this paper might however not be applicable for applications violating the memory access constraints of affine-loop nests, by e.g. having different arrays with overlapping memory regions. It is however not clear in the most generic sense, whether an application violating this requirement can be parallelized using e.g. MPI at all, since MPI itself imposes certain restrictions for non-blocking communication operations.

The objective is to provide: (i) an optimal computation–communication overlap (effectiveness), (ii) on the largest possible set of HPC applications (applicability) and, (iii) on as many architectures as possible (portability).

By choosing PLUTO, both applicability (because of stencil codes) and portability (hybrid code generation based on MPI and OpenMP) are maintained. Effectiveness, depends on the intrinsic potential of communication–computation overlap that PLUTO may provide. This potential is explored in the remaining part of this section.

3.3.1 Parallelization Process in PLUTO

In the following, a brief illustration of the loop-nest parallelization process in PLUTO is given, starting from a simple sequential code. Discussion is restricted to the aspects of PLUTO that are most relevant to this paper. For other details, please refer to [8, 13]. Let us consider the sequential code in Fig. 1.

Fig. 1
figure 1

Illustrative sequential code sample

The loop nest is composed of a temporal dimension ‘t’ and a spacial dimension ‘i’. Both dimensions form a 2-D grid. In order to compute the grid in parallel, PLUTO applies skewing and tiling polyhedral transformations that create tiles. The skewing transformation is required in order to produce data-independent tiles. The tiles are evenly distributed across the MPI processes. Figure 2 displays this general scheme: the tile distribution is done over two MPI processes (‘gray’ for process P1, and ‘blue’ for process P2). P2 starts its part of the execution as soon as a second independent tile occurs, which is the moment depicted in the figure after the two tiles at the bottom left are computed by P1.

Fig. 2
figure 2

Parallelization of 1-D Jacobi stencil code in PLUTO displaying the different stages of the loop tiling transformation, as well as the stencil pattern

In the next step, P1 requires data in order to compute boundary elements. The required data being located at P2, P2 has to explicitly communicate it to P1. This is achieved by the MPI part of the code generation in PLUTO. After that, the two processes keep advancing throughout the grid following the same logic, i.e. alternating communication operations and intra-tile compute-operations.

At this point, the two processes execute two tiles in parallel each, step by step. This local parallelism is achieved by the OpenMP part of the code generation in PLUTO. As a consequence of the skewing polyhedral transformation, the processes swipe the grid towards a skewed tangent in order to be able to execute alternating tiles in parallel. This oblique progression divides the execution into three parts: ramp-up, full-throttle and ramp-down.

In an ideal scenario, during the ramp-up phase, the number of processes involved increases until the maximum number of processes is reached. After that, the number of tiles per process starts to increase until the full-throttle phase is reached. During the full-throttle phase, the number of tiles executed in parallel, per process, reaches its maximum. From a global point of view, each process is assigned a range of tiles in an incremental fashion. Within each process, parallelism is achieved through OpenMP directives where the first loop iterates over the tiles that belong to the process (inter-tile loop). The remaining inner loops iterate within the tile itself (intra-tile loops). After the full-throttle phase is finished, the ramp-down phase represents a ‘mirrored’ equivalent of the ramp-up phase that computes the remainder of the grid.

The communication pattern between processes varies throughout the three phases. An MPI_Alltoall operation is first executed in order to exchange necessary meta-data. The meta-data consist of the amount of data that needs to be exchanged between each pair of processes. Next, the actual exchange of data occurs through a series of non-blocking MPI_Isend and MPI_Irecv operations, accordingly, followed by a global MPI_Waitall. This sequence of MPI_Alltoall, MPI_Isend, MPI_Irecv, and MPI_Waitall mimics a neighborhood communication. No computations occur in-between the MPI_Alltoall and the MPI_Waitall operations, which means that the pseudo-neighborhood communication is blocking. In addition, intermediate packing and unpacking routines are used to generate the message buffers.

The code in Fig. 3 depicts the general structure of the output codes generated by PLUTO.

Fig. 3
figure 3

General structure of a PLUTO generic hybrid code

3.4 ADCL

The Abstract Data and Communication Library (ADCL) [19] is an auto-tuning library for collective communication operations. ADCL provides for each supported operation a large number of implementations. Transparent to the user, the library incorporates a runtime selection logic in order to choose the implementation leading to the highest performance. For this, ADCL switches internally during the first iterations of the applications the implementation used in order to determine the fastest version for the current scenario. After each/some implementation(s) (depending on the selection logic) have been tested a specified number of times, the runtime selection logic chooses the implementation leading to the lowest overall execution time, and uses this version for the rest of the execution.

Among the communication operations currently supported by ADCL are the most widely used collective operations, such as Cartesian neighborhood communication, All-gather, All-to-all and All-reduce operations. Furthermore, ADCL has been recently extended to support the auto-tuning of non-blocking collective operations [4]. Optimizing non-blocking collective communication operations is however even more challenging than their blocking counterparts, due to the fact that the actual time spent in the communication operation can not be directly measured. This violates a fundamental requirement of auto-tuning libraries, namely to have accurate and reliable measurements for the alternative implementations. To resolve this problem, ADCL introduced the notion of a timer-object, which allows the library to capture the execution time of a larger, user-defined code section in order to optimize non-blocking operations. The results shown in  [4] demonstrate the benefits of this approach, with performance improvements of up to 40% compared to libraries that do not have the flexibility to adjust the algorithm used by a communication library.

Although being the most generic approach to optimize communication–computation overlap that we are aware of (as of today), ADCL still suffers some fundamental limitations, for example, the fact that it takes only the underlying algorithms into account w.r.t. NBCs. Other major factors detailed in the next section, such as overlapping code sections, progression, and overlapping NBCs are code-dependent. This means that they are already restricted, fundamentally, to the parallel application itself. As a consequence, ADCL (as well as the other similar auto-tuning projects) performs only a downstream optimization within a potential that is already limited by the upstream parallelization strategy, where the structure of the parallel code originates from.

4 Concept

The first step towards the goal of this paper is to incorporate double-buffering and look-ahead methods in PLUTO in order to transform the blocking MPI_Alltoall operation into its non-blocking equivalent (MPI_Ialltoall) operation. For this, the Ialltoall operation of the next iteration is launched in advance during the current iteration. This also requires making a copy of the meta-buffers, since the original buffers are involved in the Ialltoall operation of the current iteration and thus can not be used for another non-blocking communication operation simultaniously. The computational block is then executed while progressing the Ialltoall operation of the next iteration in the background. After the computational block is finished, the Ialltoall operation of the next iteration is completed and the meta-buffers of the current iteration are restored.

Hiding the pseudo-neighborhood communication-operation The main reason that the MPI_Waitall function is called at the end of the communication block (which therefore blocks the pseudo-neighborhood operation) is that the next iteration’s tiles are part of the compute block which requires the subsequently exchanged data. In order to hide the pseudo-neighborhood communication, a loop fission transform is applied on the compute block to split it into two separate compute-blocks. Only the inter-tile loop is split into two separate analogous loops. The first inter-tile loop operates on the process’ inner tiles, excluding every intra-process neighboring tile (see Fig. 2). This way, the inter-tile loops are guaranteed to operate on data-independent tiles, and are assured to safely overlap the pseudo-neighborhood communication-operation. The MPI_Waitall operation is executed afterwards along with the data unpacking routines. The second inter-tile loop is then able to operate safely on the tiles that require the exchanged data. Note that the first compute/OpenMP block includes progress calls for both the Ialltoall and Isend / Irecv operations (MPI_Test and MPI_Testall, respectively). The second compute/OpenMP block, however, contains only progress calls for the Ialltoall operation, since the pseudo-neighborhood operation is finished at this point (Fig. 4).

Fig. 4
figure 4

The new structure for generic hybrid code which enables communication–computation overlap

4.1 Maximizing Communication–Computation Overlap

The previous subsection described how to fundamentally enable communication–computation overlap in PLUTO. Subsequently, the goal is to maximize the communication–computation overlap, considering the four factors detailed in Sect. 3.2.

Overlapping Code Regions The maximal overlapping code region consists of the two compute-intensive blocks. The first block overlaps both the Ialltoall and the pseudo-neighborhood communication, and the second one overlaps with the rest of the Ialltoall operation only. Because the two blocks compute different tiles, the tile size used during the automatic parallelization has a direct impact on the size of the compute blocks. Larger tile sizes increase the time spent in the compute blocks, which itself maximizes the overlapping code sections. However, maximizing the tile size can potentially harm the communication–computation overlap, since it reduces potentially the number of tiles for a fixed problem size. This is explained in details later in Sect. 4.1.1.

Underlying Algorithms The Ialltoall communication in PLUTO operates on small amounts of data, called process-pairs data counts. Since the underlying algorithm used for the Ialltoall operation has a larger impact on performance, the goal in this section is to use the Abstract Data and Communication Library (ADCL [19]) to auto-tune the Ialltoall operation. However, the learning phase of ADCL requires multiple iterations where the communication and computational patterns remain constant. Because of this, the ramp-up and ramp-down sections have to be excluded from the approach. This is not a major restriction since the full-throttle phase is the dominant phase for long running applications, see Fig. 2. If the total number of time iterations (‘t’ in Fig. 2) increases, the full-throttle section gets longer, while the ramp-up, and ramp-down sections remain constant.

Auto-tuning collective communication operations rely on the library being able to accurately measure the execution time of different implementations and algorithm of the corresponding operation. This is highly challenging for non-blocking collective operations, since the communication is on purpose overlapped with compute operations. To solve this problem, ADCL introduced the timer object, which measures the time spent in a code section and upon which the auto-tuning decision of ADCL is made [7]. The code section measured by the timer object has to cover both the Ialltoall and the corresponding overlapping code regions. This highlights again an advantage of the approach presented in this paper, since deciding on the location of where to insert calls to the timer object can be automatically determined using the auto-parallellizer, instead of asking the user to identify the proper location for this. The code in Fig. 5 displays the new generic code structure. Both ADCL and MPI progress function calls are excluded for simplicity.

Fig. 5
figure 5

The new structure for generic hybrid code, the communication–computation overlap + auto-tuning

Progression Progression is a critical aspect for hiding non-blocking collective operations in the background. An insufficient number of progress calls leads to longer running NBCs, which decreases the communication–computation overlap. Too many progress calls create a performance overhead [4]. Finding the optimum frequency of progress calls is challenging. To solve this challenge, we extended the ADCL library to auto-tune the frequency of progress calls along with the non-blocking operation itself. Since the compute part of the application consists of multiple nested loops, the progress calls can be inserted at each loop level. A progress function located at the innermost loop will lead to a significantly higher number of progress calls than a progress function being located in the outermost loop. A depth attribute tests the different Ialltoall implementations using different depths values of progress calls. The depth that provides the best performance within the timer bounds is selected for the rest of the application.

Overlapping NBCs If the input sequential code operates on multiple grids, the generated code will have multiple communication blocks. ADCL has the ability to attach one common timer object to different requests, which simultaneously co-tunes multiple requests at the same time. Therefore, if multiple Ialltoall operations are generated alongside multiple pseudo-neighborhood communication operations, ADCL can combine all operations through a common timer object. The same auto-tuning logic described previously applies here as well.

Additionally, ADCL includes a user-configurable meta-parameter to indicate whether two similar requests should be merged and auto-tuned as a single request, or whether they should be distinguished. Two requests are similar if all their parameters (e.g. collective operation, message length, etc.) are the same, except possibly their buffers. This meta-parameter, along with other useful meta-parameters, is defined in a local ADCL configuration file. Switching ’off’ the request-merging feature produces larger pools of implementations and, as a consequence, achieves a finer optimization. In fact, if all the requests are distinguishable, ADCL considers every combination of implementation as a new and distinct implementation. However, this has an impact on the length of the selection phase. Switching the request-merging feature ’on’ would shrink the search space.

4.1.1 Tile Size Calculation

The tile size plays a significant role in the granularity of the parallelization, as well as the potential progression of the entailed non-blocking operations. Though PLUTO is fully automatic, an option to specify the tile size manually is provided. The tile size affects the scalability of the application as follows: the full-throttle phase (during which the number of tiles that can be executed in parallel is maximal) puts a limit on the maximum number of processes that can be used. If the number of processes is larger than the number of tiles, a subset of processes will have only one tile to compute at full-throttle, and the rest of the processes remains idle. The number of tile per process during the full-throttle phase is referred to as tile-occupancy, and is proportional to the problem size. A minimal tile-occupancy of 2 is required in order for the MPI processes to compute at least one tile while the other tile is involved in inter-process communication. This inverse correlation between the tile size and tile-occupancy represents a critical higher-bound (in terms of tile size) in order to ensure communication–computation overlap.

On the other hand, the global ratio of communication to compute-operations is correlated to the data exchanged between the pairs of neighboring processes, and more specifically, their corresponding adjacent tiles. In general, the amount of communicated data is of one dimension less than the amount of computed elements. For example, the intersection of two adjacent volumetric tiles is a superficial space, and the intersection of two adjacent superficial tiles is a linear space, and so forth. This suggests that a larger tile increases the ratio of communication to compute-operations (offering more computations to hide the communication cost).

Consequently, the optimal tile size is the largest tile size for which the tile-occupancy is not less than 2 for each process. Yet, for a single sequential code, PLUTO outputs a single hybrid code that is reusable with an arbitrary number of processes and problem sizes, but with a fixed tile size. While the tile size is a required input parameter for the parallelization, the number of processes and the spacial and temporal dimensions are however unknown at this point, determining the optimal tile size is impossible with the original version of PLUTO. Thus, we extended the overall parallelization process in order to calculate the optimal tile size by requiring the user to provide the number of processes as well as the spacial dimensions of the computation grid beforehand. The limitation of this approach is that the user is required to create a separate executable for every process count and problem size. While this represents an inconvenience to the end-users, it is not unusual for many production codes—or even some benchmarks such as the NAS Parallel Benchmarks—to have a similar requirement, most often due to the static declaration of variables. The benefits of having the ability to determine the optimal tile size outweigh the inconvenience.

For example, let us consider a scientific application ran by 256 processes to compute a two-dimensional space of 20 million elements. Using tile size of \(256\,\times \,256\) would lead to a maximum tile-occupancy of 4 tiles per process. A larger tile size of \(512\,\times \,512\) elements produces a tile-occupancy that is below 1. Thus, assuming quadratic tiles using multiples of power of 2 for the tile size, the optimal tile size for this scenario is \(256\,\times \,256\).

One of the major benefits of tiling is to improve the data locality (spacial and temporal) for faster data accesses. A tile that fits into the highest CPU cache level allows faster calculations compared to a larger tile that would produce cache misses. For this reason, tile sizes floored to powers of 2 are privileged to void the padding for cache-friendliness [26]. However, if the tile size is a single-digit per dimension (e.g. \(7\,\times \,7\,\times \,7\)), then flooring to powers of 2 (\(4\,\times \,4\,\times \,4\) in this case) is discarded since other performance factors become predominant when small tile sizes are used [11].

4.2 Automatic ADCL Configuration

As discussed above, ADCL comprises different configuration meta-parameters that influence its runtime behavior. The main aspect that the configuration parameters are designed to tune is the balance between accuracy and performance overhead during the learning phase. As part of the work described in the paper, we integrated a heuristic to adjust the ADCL parameters into the code generation process. Based on some problem characteristics such as the problem size and the number of iterations required by the application, one can estimate the configuration’s overhead by determining the ratio of the total number of iterations needed by the ADCL selection phase to the total number of iterations of the application.

Specifically, if the number of iterations required for the ADCL selection phase is exceeding a predefined threshold value (e.g. 15% ), the ADCL configuration is gradually modified in order to reduce the overhead estimate, until a satisfactory overhead estimate is met. The modification to the ADCL configuration include (i) enabling the request-merging feature explained in Sect. 4.1, (ii) decreasing the number of iterations per implementation during the ADCL selection phase, (iii) activating alternative ADCL search algorithms such as the attribute-based heuristic Sect. 3.4.

5 Experimental Evaluation

In the following section, the impact of the approach described in this paper on optimizing the communication–computation overlap using the ADCL-PLUTO version is evaluated. First, the execution environment is described, followed by detailed results obtained with four different application benchmarks, and two distinct hardware platforms.

5.1 Experimental Platforms

Two clusters located at the University of Houston have been used in the subsequent tests, namely Opuntia and Crill. Opuntia consists of 92 nodes with a total of 1,740 processor cores of 2.8 GHz Intel Xeon E5-2680v2, 1.83 TB of total main memory and a 56 Gb/s Ethernet interconnect.

The second cluster, Crill, consists of 16 nodes with four 2.2 GHz 12-core AMD Opteron (Magny Cours) processor cores each (48 cores per node, 768 cores total) and 64 GB of main memory per node. Each node further has two \(4\times \) DDR InfiniBand Host Channel Adapters (HCA). The 1.8 series of Open MPI [18] was used in all instances, along with the 0.11 version of PLUTO [1] and the 2.0 version of ADCL [32].

5.2 Application Benchmarks

Four application benchmarks were used to assess the communication–computation overlap achieved by the approach presented in this paper. The four benchmarks are: FDTD-1D, Jacobi-1D, Seidel-2D and ADI-2D. The four benchmarks were developed by Saday et al. [1].

FDTD-1D (Finite-Difference Time-Domain) is a numerical analysis technique used for modeling computational electrodynamics. FDTD-1D performs computations on two different one-dimensional buffers, which represents the spacial dimension. The computations are repeated a certain number of times, which represents the temporal dimension. Jacobi-1D uses relaxation to find discretized solutions to differential equations. It performs computations over a temporal dimension as well, but on only a single one-dimensional buffer. A second temporary buffer can be used for optimization purposes, but is not required.

Seidel-2D is an iterative method used to solve systems of linear equations. It performs computations on a single two-dimensional buffer, along with a temporal dimension. ADI-2D (Alternating Direction Implicit) is a finite difference method for solving parabolic, hyperbolic and elliptic partial differential equations. It perform computations on two two-dimensional buffers along with a temporal dimension, a third two-dimensional buffer is used for reading purposes only.

The four application benchmarks were chosen for the sake of representativeness. They can be ordered from the most memory-intensive benchmark to the most compute-intensive one based on the ratio of communication to compute-operations that each benchmark involves:

  • FDTD-1D: two one-dimensional buffers (most memory-intensive)

  • Jacobi-1D: one one-dimensional buffer (more memory-intensive)

  • ADI-2D: two two-dimensional buffers (more compute-intensive)

  • Seidel-2D: one two-dimensional buffer (most compute-intensive)

The higher the number of buffers, the more memory movements are required compared to the amount of compute-operations. In this case, the code is classified as more memory-intensive. On the other hand, the higher the dimension of the buffer(s), the more compute-operations are required compared to the number of memory accesses. In this case, the code is considered to be more compute-intensive (c.f. Sect. 4.1.1).

The four application benchmarks therefore constitute an impartial base-set in order to evaluate the communication–computation overlap achieved by the ADCL-PLUTO version. Each application benchmark is accompanied by a set of performance evaluation parameters. The performance evaluation parameters are: process counts, time iterations (total length of the temporal dimension), and space size [total length, surface, or volume of the spacial dimension(s)].

Based on the experimental platforms described in Sect. 5.1, the process counts chosen for the experimental evaluation were: 64, 128 and 256 for both Opuntia and Crill. In all of the subsequent tests we did not use multiple threads per process, i.e. an MPI job with 256 processes used the same number of cores on the cluster. The tile size used in each test case was based on the algorithm described in Sect. 4.1.1, i.e. we used what we determined to be the optimal tile size for each scenario in the subsequent tests.

5.3 Performance Results Without ADCL Configuration Heuristic

The first set of experiments were conducted without the ADCL configuration heuristic described in Sect. 4.2. Two test cases have been selected for this section, one for each platform: 256 process scenario for Crill and 128 processes for Opuntia. Tables 1 and 2 show the values of different parameters for two select scenarios. More test cases and results are presented in subsequent analysis.

Table 1 Benchmark parameter values and subsequent performance indicators on Opuntia. (\(K=thousand\), \(M=million\))
Table 2 Benchmark parameter values and subsequent performance indicators on Crill (\(K=thousand\), \(M=million\))

The first three columns show the number of processes used, the number of time-step iterations (temporal dimension), and the size of the problem space (spacial dimensions). Temporal dimensions and spacial dimensions were chosen to give reasonable execution times, as well as tolerable memory space utilized by the main data structures. Column four displays the relative number of iterations spent in the ramp-up and ramp-down phases (combined) (recall that no communication–computation overlap is achieved during these two phases, c.f. Sect. 4.1). The iterations mentioned here are the iterations of the output-parallel application, and not the input-sequential benchmark. Column five contains the relative performance cost of the ADCL selection phase (in terms of iterations). Column six shows the tile-occupancy and the last column displays the corresponding tile-size.

Each benchmark has been executed three times on randomly generated data, reporting the average of those three executions (double precision numbers have been used, along with result checking). Since the compute nodes used were allocated exclusively for these tests, there was no significant deviation observed between the individual executions. The typical relative standard deviation observed between the three runs was in the range of 0.5%. After each execution, the maximum time spent across all processes is recorded. The executions were achieved within the same batch scheduler allocation and thus had the same node assignments per benchmark for the sake of consistency, in accordance with the parameter values exposed in Tables 1 and 2.

The first set of tests performed were using the original PLUTO version, i.e. without communication–computation overlap. Among the values reported by PLUTO are both computational time, communication time, total time, as well as other secondary measures such as the total communication volume across all processes. The sum of communication time and computational time represents the theoretical performance achieved with no communication–computation overlap, and is used as a base-line to assess the performance of the ADCL-PLUTO version. The same time-measuring components are subsequently used by the ADCL-PLUTO version to measure the total time spent in the communication–computation section. For the sake of fairness, the other (common) parts of the generated code are excluded.

Fig. 6
figure 6

Execution times of the four benchmarks on Opuntia with 128 processes (left) and Crill with 256 processes (right)

Fig. 7
figure 7

Relative Overlap Ratio (ROR) of the four benchmarks on Opuntia for 128 processes (left) and Crill for 256 processes (right)

Figure 6 displays the results on Opuntia (left) and Crill (right). The graphs are structured as follows: each set of four bars indicates the execution time spent in each benchmark (denoted on the x-axis) from left to right showing the time spent in compute-operations only (\(T_{comp}\)), in communication operations only (\(T_{comm}\)), compute- and communication- operations without overlap (\(T_{comp}+T_{comm}\)), and compute- and communication- operations using the ADCL-PLUTO version. Figure 7 displays the corresponding ROR values (c.f. Sect. 3.1).

The results indicate that the ADCL-PLUTO version achieved a high communication–computation overlap in most cases, producing overall performance improvements ranging from \(32\%\) (corresponding to ADI-2D on Opuntia) to \(43\%\) (corresponding to Seidel-2D on Crill). The Relative Overlap Ratio (ROR), i.e. the relative amount of overlap achieved during the actual execution, is in the range of \(90\%\) for FDTD-1D, Jacobi-1D, and Seidel-2D. However, the results also expose lower ROR values for ADI-2D on both platforms, namely around \(66\%\) on Opuntia, and \(76\%\) on Crill. Our analysis revealed that the source for the lower ROR value is the higher proportion of the number of iterations spent in the selection phase of ADCL (as shown in Tables 1 and 2) compared to the other benchmarks. To reduce the overhead of the selection phase, we activated the ADCL configuration heuristic described in Sect. 4.2 next.

It is also worthwhile to notice that the most memory-intensive and the most compute-intensive benchmarks, i.e. FDTD-1D and Seidel-2D, respectively (c.f. Sect. 5.2), exhibit the lowest ROR values. However, Jacobi-1D and ADI-1D, which are less memory-intensive and less compute-intensive, respectively, exhibit the highest average. This suggests that the less compute-intensive and the less communication-intensive the code is, the higher the potential of the ADCL-PLUTO pair becomes w.r.t. achieving a maximal communication–computation overlap. In other words, the outcome of the ADCL-PLUTO pair is not only partially, but also intrinsically correlated to the inherent balance between the communication and compute-operations of the input code.

5.4 Performance Results Using the ADCL Configuration Heuristic

The next set of measurements document the performance obtained with the ADCL-PLUTO version using the configuration heuristic of ADCL. The results obtained are displayed in Fig. 8 for both platforms using 64, 128, and 256 process testcase. The ROR values for ADI-2D benchmark for the two scenarios discussed in the precious section increased to \(89\%\) on Opuntia, and \(91\%\) on Crill. The performance improvement across the four benchmarks also increased to \(37\%\) (corresponding to FDTD-1D on Crill) compared to \(32\%\) obtained without automatic ADCL configuration. This improvement is due to the reduction in the number of iterations needed by ADCL to finish the selection phase of the ADI-2D benchmark. The values in Table 3 represent the proportion of iterations needed by ADCL to finish the selection phase of the ADI-2D benchmark, with automatic ADCL configuration (left column) versus with automatic ADCL configuration (right column).

Note, that the other three benchmarks discussed in the two scenarios of the previous section did remain unchanged, since the threshold value used in the heuristic has not been met, and thus the default ADCL configuration is retained.

All four benchmarks depicted the dissemination algorithm of the Ialltoall collective operation as winner of the auto-tuning phase. The Alltoall operation in the ADCL-PLUTO version consists of exchanging meta-counters between the processes that are involved in the subsequent pseudo-neighborhood communication (c.f. Sect. 3.3.1), and the dissemination algorithm is more adequate for exchanging short messages compared to its other counterparts.

Fig. 8
figure 8

Execution times of the four benchmarks on Opuntia (left) and Crill (right) using 64, 128, and 256 processes using automatic ADCL configuration

Table 3 Comparison of the relative performance cost of the ADCL selection phase for ADI-2D with versus without the automatic ADCL configuration phase

5.5 Effect of the Tile Size

In order to demonstrate the effect of the tile size calculated as described in Sect. 4.1.1, measurements with enforced different tile sizes have been performed for the same two configuration used in Sect. 5.3. The tile sizes chosen were the closest smaller and closest greater than the optimal tile sizes displayed previously in Tables 1 and 2. The results obtained are displayed in Figs. 9 and 10.

Fig. 9
figure 9

Execution times of the four benchmarks on Opuntia using 128 processes with additional tile sizes

Fig. 10
figure 10

Execution times of the four benchmarks on Crill using 256 processes with additional tile sizes

The x-axis displays the different tile sizes, the optimal tile sizes is always the middle value for each benchmark. The results show that the balance between compute-operations and communication-operations in terms of performance cost is altered when the tile size varies, even with a constant problem size. If the tile size decreases below the optimal size, then the tile-occupancy increases, which increases the ratio of communication-operations to compute-operations. This effect is observed on both Opuntia and Crill, and is even more visible with memory-intensive benchmarks.

On the other hand, increasing the tile size drops the tile-occupancy below 2. As a consequence, a subset of the processes becomes unable to overlap its part of the pseudo-neighborhood communication with the compute-operations. This is due to the fact that tiles (being the basic compute blocks) are fundamentally atomic and cannot be split. A worse case scenario is when the tile-occupancy drops below 1, in which part of the processes becomes completely idle (with no tile assigned).

As a consequence of increasing the tile size, the ROR suffers due to the imbalance between the amount of computations and compute-operations. However, the communication costs can decrease since fewer processes are involved. As a repercussion of this effect, a better overall performance is obtained when no communication–computation overlap is achieved (with the default PLUTO) on Opuntia. Note, that this has been recognized by the PLUTO-ADCL version, which thus preservers the ability to deliver the best performance even in this situation, avoiding the necessity to maintain different parallel versions for each platform.

These results also demonstrate the benefits of auto-tuning the non-blocking collective operations. Figures  9 and  10 denote on the x-axis the implementation of the non-blocking collective operation that lead to the minimum execution time. Depending on whether there is one or two independent non-blocking collective operations, the x-axis denotes the implementation for each instance separately using lin for the linear algorithm, diss for the dissemination algorithm and pair for the pairwise exchange algorith. Furthermore, the graphs also show the optimal depth for the progress function as determined by ADCL and described in Sect. 4.1. The results proove the usefulness of auto-tuning the collective oprations: different implementations were considered optimal by ADCL as a result of the learning phase. A library using a fixed implementation for the non-blocking all-to-all operation would have inadvertently lead to lower performance overall.

To summarize the findings of this section, the results demonstrate that the ADCL-PLUTO version is able to create nearly optimal execution times even for (enforced) suboptimal tile sizes for the test-cases shown here. The data also supports our reasoning on the impact of the tile size detailed in Sect. 4.1.1.

6 Conclusion and Future Work

In this paper, an approach to automatically optimize the communication–computation overlap is presented. The approach is based on ADCL-PLUTO, an extension of the PLUTO automatic parallelizer incorporating support for non-blocking communication. The paper identifies the main challenges for achieving good overlap of communication and compute operations. It further presents the challenges and conceptual extensions necessary in integrating ADCL, an auto-tuning library for collective communication operations, which allows to optimize both, the underlying algorithms used for the non-blocking collective operations as well as location and frequency of accompanying progress function calls. Furthermore, guidelines for determining optimal tile sizes to maximize the overlap between communication and compute operations have been developed, along with a heuristic guiding the search algorithm of the ADCL auto-tuning library depending on the number of iterations executed by the application.

The work has been evaluated with four application benchmarks on two different platforms and multiple process counts. The results indicate significant performance improvements in virtually all test cases. The parallel applications using the ADCL-PLUTO software developed as part of this work achieved a performance improvement in the range of 32–43% compared to the version using blocking communication operations, and achieved a relative overlap ratio (ROR) of up to 95% of the maximum theoretical communication–computation overlap identified for each scenario.

There are multiple avenues on how to extend the work presented in this paper. First, the neighborhood communication operation could also be converted to use the new MPI-3 collective operations (MPI_Neighbor_alltoall), thus allowing to auto-tune this sequence of communication operations similarly to the Alltoall operation. Second, historic learning methods that already exist in ADCL might be used to perform the auto-tuning during the ramp-up and ramp-down phases as well. Historic learning allows ADCL to utilize knowledge/information gathered in previous executions, avoiding the learning phase and thus potentially reducing the overhead generated in the learning phase. Third, the solution could be applied and tested with more applications that fit the general pattern supported by PLUTO. Finally, one interesting aspect of the work would be to evaluate using multiple threads per OpenMP block and its impact on the performance and overlap ratios, especially for dynamically scheduled OpenMP loops.

Both ADCL and PLUTO are open source software packages available for download on their corresponding web pages. The modified ADCL-PLUTO version is available for download on the authors webpages or upon request.