1 Introduction

Modern processors’ great tendency is the increasing number of cores integrated on chips, which avoids the ascends of clock frequency and power. Large numbers of HPC systems on the TOP-500 list [1] are composed of such processors that provide high performance. Ideally, the performance should scale with the number of cores built in. The basis of such a composition block is the multicore chips sharing memory within a single node. Programming for such parallel architectures consisting of NUMA (nonuniform memory access) [2] nodes is hugely challenging for the programmers have to explicitly control the memory accesses to take full advantage of the memory access characteristics. On the other hand, the enormous amount of cores is an immense potential for thread-level parallelism.

Practically, the programmers are apt to ignore the hybrid memory system and utilize a naive OpenMP or MPI (Message Passing Interface) implementations for convenience. However, the former strategy tends to behave poorly due to its applicability to SMPs (Symmetric Multi-Processing). It is advisable for the MPI model since most MPI libraries take advantage of the shared memory within a node and optimize the inter-node communications.

Nonetheless, in this paper, we leverage both MPI and OpenMP to investigate the multicore scaling, which is a hybrid programming model for the NUMA architectures. It is not the first time that this approach has been utilized for parallel optimizations of applications. However, we employ a core grouping strategy to excavate the CPU affinity. We also propose an outermost dimension decomposition scheme to integrate the data-level parallelism scheme—SIMD (Single Instruction Multiple Data)—into the multicore scaling to exploit the best whole platform performance.

Our main contributions can be summarized as follows:

  • Proposing a four-level parallelism scheme, including NUMA node decomposition, block decomposition, thread-level parallelism, and data-level parallelism, to exploit the potential benefits of the multicore systems.

  • Implementing the above techniques using a hybrid MPI+OpenMP programming model on two SIMDlized NUMA systems with SIMD intrinsics supported by each specific architecture, one ARM NEON Extension and the other Intel AVX.

  • Evaluating the proposed multilevel parallelism scheme on two NUMA platforms with two typical 3D stencil computations. Comparisons to naive implementations of OpenMP show that it obtains a better performance as well as scalability.

The rest of the paper is organized as follows: Sect.  2 is a brief introduction of OpenMP and MPI models, NUMA architecture, and our motivation for the hybrid programming method. Section 3 illustrates our multilevel parallelism scheme we propose. Experimental evaluations of our proposed scheme are conducted and demonstrated in Sect.  4. Section 5 describes some related work. Finally, we conclude the whole paper in Sect. 6 and provide with some future work hints.

2 Background and motivation

Some preliminary basis about OpenMP, MPI, and NUMA architecture and our hybrid programming approach is described in this section. We also illustrate the motivation for hybrid programming.

2.1 OpenMP programming model

OpenMP is detailedly discussed in [3], and much latest features about it can be found at https://www.openmp.org/. It is characterized by its compiler directives as well as enormous library calls. As a portable approach for parallel programming, the high-level abstraction of the parallel description provided by OpenMP reduces the difficulty and complexity of parallel programming so that programmers can devote more energy to the parallel algorithm itself, rather than its specific implementation details. For multithreaded programming based on data diversity, OpenMP is generally the right choice. Simultaneously, the use of OpenMP also provides greater flexibility, and it can quickly adapt to different parallel system configurations. Thread granularity and load balancing are challenging problems in traditional multi-threaded programming, but in OpenMP, the OpenMP library takes over part of these two aspects of work from programmers. However, as a high-level abstraction, OpenMP is not suitable for situations that require complex synchronizations and mutual exclusion between threads. Another disadvantage of OpenMP is that it cannot be used on nonshared memory systems (such as computer clusters), where MPI is utilized much more.

2.2 MPI programming model

MPI is further described in [4], and [5]. As a de facto standard for parallel programming on the distributed memory systems, it has two advantages: achievable performance and portability. Specifically, it improves communication efficiency by measures, including avoiding multiple repeated copies from memory to memory, allowing overlap of calculation and communication. The standard API and libraries contribute to the portability. All in all, MPI provides a language and platform-independent standard that can be widely used to write message-passing programs. It is practical, portable, efficient, and flexible and has not much change from the current existing implementation when ported from one machine to another. Challenges also exist. Carefully thought-out strategies are required when programming. The data portions and correlated communications have to be considered seriously. Other challenges include memory requirements and fault tolerance [6].

2.3 Hybrid programming approach

For the appearance of large numbers of multicore platforms with shared memory nodes or clusters, it is quite natural to combine those mentioned above two parallel programming models together, which is a kind of hybrid programming approach [7]. This model can leverage the advantages of MPI among nodes with distributed memory systems and OpenMP within a node. In other words, we distinguish between the memory hierarchies and can maintain cross-node performance with MPI and reduce the number of processes needed simultaneously. The OpenMP is then in charge of the inner-node performance optimization with thread-level and other level parallelisms.

2.4 NUMA architecture

NUMA architecture is a shared memory architecture that describes the placement of main memory modules with respect to processors in a multiprocessor system [8]. Like almost every other processor architectural feature, ignorance of NUMA can result in a sub-par application memory performance. In the NUMA shared memory architecture, each processor has the local memory module that it can access directly with a distinctive performance advantage. At the same time, it can also access any memory module belonging to another processor using a shared bus [8]. However, the two various memory accesses’ costs are distinctly shown in Table 1 for the ARM platform we work on. A NUMA-node always has a distance of 10 to itself, which is the lowest possible value. In contrast to UMA, which provides a centralized memory pool, it has to traverse the interconnect and connect to the remote memory controller if a memory controller accesses remote memory. Thus, accessing remote memory adds additional latency overheads to local memory access. Because of the different memory locations, a NUMA system experiences nonuniform memory access time.

Table 1 NUMA node distances (ARM)
Table 2 Core grouping according to node distances (ARM)

There exist two critical notions in managing performance within the NUMA shared memory architecture: processor affinity and data placement [8]. It is based on the NUMA distances listed in Table 1 that we can deduce the following core grouping of the eight NUMA nodes in Table 2. Further, we conduct the multicore scaling based on the core grouping. For instance, we would prefer to deploy nodes 0, 1, 3, 5 rather than 0, 1, 2, 3, when four NUMA nodes are required for the former scheme has a better memory access behavior.

2.5 Motivation

We demonstrate a simple example in this section to illustrate the necessity of the hybrid programming model. Table 3 is the performance scalability for a 3D-7pt stencil computation on one of our experimental platforms. The platform has eight NUMA nodes, with each consisting of eight cores, and we only utilize OpenMP for the multicore scalability. It can be observed that within a node when the cores utilized is less than 8, approximate linear scalability is presented. However, when much more cores across multi-NUMA nodes are utilized, no further performance benefits are obtained. A performance regression phenomenon even occurs due to the memory bandwidth competitions.

Table 3 Performance scaling using only OpenMP (ARM)

3 Multilevel parallelism scheme

In this section, we demonstrate the proposed four-level parallelism scheme in detail. The overview of our suggested parallel scheme is illustrated in Fig. 1.

Fig. 1
figure 1

Overview of the four level parallelisms we utilize

To be specific, the first two levels of parallelisms are the block decomposition of the original data grid according to the number of NUMA nodes and the processes utilized. Figure 1a shows the decomposition along z-dimension without cutting the other two dimensions. When it comes to Fig. 1b, we depict the third-level parallelism—thread-level parallelism—a further decomposition among threads within a NUMA node along the outer two dimensions. Finally, Fig. 1c is the last-level parallelism we study the data-level parallelism-which is usually implemented through SIMD ISAs. We provide much more details of the techniques mentioned above in the remaining section, respectively.

3.1 Block decomposition

As is depicted in Fig. 2, the first two levels of parallelisms are the block decomposition among NUMA nodes and processes. We conduct the block composition among NUMA nodes and processes along the outermost z-dimension, for it has a longer reuse distance compared to the two inner dimensions. For instance, for an \( Nz {\times } Ny {\times } Nx \) input grid, the reuse distance of the 3D-27pt stencil computation along the innermost x-dimension is 1. However, the reuse distances along y- and z-dimension are Nx and \(Nx {\times } Ny\) separately, which are more extensive than that of the x-dimension. We utilize such a decomposition scheme to avoid the ruins of data locality.

Fig. 2
figure 2

The first two level parallelisms—block decomposition

The halo areas for exchanges and asynchronous communications are also colored in Fig. 2. It is an illustration for a 2D stencil computation input grid, and we cut along the outermost y-dimension correspondingly. Based on the decomposition scheme mentioned above, we have to do the halo exchanges of diverse blocks for the next time step updates represented by the red double arrows. Three kinds of points exist in the scheme: the boundary points, the interior points, and the halo area points. The two former kinds of points are without necessities for exchanges, while the third kind of points has to do the asynchronous exchanges.

What is worth noting is that our outermost dimension splitting scheme in Fig. 2b spends less time for halo exchanges compared to the naive cross splits along both dimensions. Generally speaking, for an input grid of size \(Nx {\times } Ny\), and a decomposition scheme of \(Bx {\times } By\) of the naive implementation, a total number of exchanges of \( 2[ (Nx/Bx -1) {\times } Ny / By + (Ny/By-1) {\times } Nx / Bx ]\) are required. Meanwhile, for our scheme, the number of exchanges necessary is \( 2 [Nx Ny / (Bx By) -1] \) for the same number of blocks. The difference between the two numbers is:

$$\begin{aligned} \ \begin{aligned} \varDelta&= 2\left[ \left( \frac{Nx}{Bx} -1\right) \frac{ Ny}{By} + \left( \frac{Ny}{By} -1\right) \frac{ Nx}{ Bx} \right] - 2 \left( \frac{Nx}{Bx} \frac{Ny}{By} - 1 \right) \\&= 2 \left( \frac{Nx}{Bx} -1\right) \left( \frac{Ny}{By} - 1\right) \end{aligned} \end{aligned}$$
(1)

In practice, we have \((Nx/Bx -1) > 0 \) and \((Ny/By -1) > 0 \) and as a result we have \( \varDelta > 0\). Consider the two schemes in Fig. 2: four blocks are decomposed from the original input data grid. Eight exchanges are required for the four decomposition boundaries for each block has to receive and send at the same time for the naive decomposition. As to our decomposition scheme, only three decomposition boundaries exist, and a total number of six exchanges are necessary for the same number of blocks.

Table 4 Halo exchanges overheads for diverse decomposition schemes

Besides, we evaluate the communication overheads for the two schemes above, and the results are listed in Table 4. The problem size is \(512^{3} \) and time steps = 100. The left column x-y-z represents various block decomposition schemes (eight blocks in total), and the total time column the corresponding total time including the halo exchange overheads. We also list the percentage of exchange overheads. It can be observed that our outermost splitting scheme 1-1-8 has a minimum overhead of 1.70s, and it accounts for 8.60% of the total time, while the innermost decomposition scheme 8-1-1 has a maximum overhead of 10.05s and a percentage of 21.88 of its overall time. Other block decomposition scheme overheads are between them.

3.2 Thread-level parallelism

For the third-level parallelism, we refer to the thread-level parallelism within a NUMA node. The parallel scheme is outlined in Algorithm 1. For the original stencil computation depicted in Line 8, there exist three nested loops. The original input array is referred as old, while the new updated one as new. The kernel update formulation is in Line 8, in which 26 neighbor elements and the center element itself are weighted by their separate coefficients.

To parallel the code, we add the parallel directive to it in Line 1. Line 2 is the workload scheduling approach supported by OpenMP. In Line 3, we assign the parallelism granularity by setting the threads utilized. Line 4 is the parallel depth. We set it as 2 by combining the outer two dimensions and leave the innermost dimension for data-level parallelism described in the next section.

figure a

3.3 Data-level parallelism

The last-level parallelism we investigate is the data-level parallelism, which is the SIMD extension supported on each platform. We omit the specific principles and regulations for much has been discussed in [9,10,11,12,13], and [14]. We make use of a 1D-3pt stencil, for instance, to demonstrate the SIMD intrinsic implementations on the two platforms. Two diverse implementations of the two specific platforms are listed in Fig. 3 briefly:

Fig. 3
figure 3

Two Intrinsic Implementations using SIMD Extension

For the ARM platform, the NEON extension has a 128-bit vector length, and the V corresponding to it is 2 for double precision. The Intel platform supports a vector length of 256-bit, and its V equals 4. It can be observed that they have a similar implementation pattern. Note that the data types of old and new are corresponding SIMD data types on separate platforms rather than double precision.

3.4 Put it all together

Taking all the aforementioned techniques into consideration, the pseudocode for the kernel combining both MPI and OpenMP would be in Algorithm 2:

figure b

The first two-level parallelisms are initially conducted before the time step iterations in Lines 1-2. Note that for the MPI instances of multiprocesses in Line 5, we add parallel to indicate that they are performing concurrently. It also contains the halo exchanges and asynchronous communications mentioned earlier, which we omit in the pseudocode. Line 5 is the OpenMP parallel region mentioned above. In Line 9, we express the kernel updates of SIMD implementations in brief.

4 Performance evaluation

4.1 Experimental test bed

Details about the core, socket, system, and programming of the experimental platforms are in Table 5. The experimental ARM processor integrates 64 ARMv8 instruction set compatible processor cores and adopts a parallel system on chip (PSoC) architecture. The NEON extension of the platform supports a 128-bit length vector length, with double precision support. The 64 cores are divided into eight NUMA nodes, each with eight cores. As to the Intel platform, it is formerly the products of Broadwell. The SIMD extension of it has a vector length of 256-bit with double precision supported. Its 20 cores are grouped into two NUMA nodes evenly. Both platforms have multilevel memory hierarchies and support hardware prefetching mechanism. Other specific architectural parameters are in Table 5. All benchmarks are evaluated using -O3 compiler options with no other unique options deployed.

Table 5 Architectural summary of experimental platforms

4.2 Stencil overview

One of the benchmarks we work on is shown in Fig. 4. In this paper, we take two typical 3D stencils, for instance, to evaluate the platforms’ performance. Some general characteristics of the stencils can be found in Table 6.

Fig. 4
figure 4

3D-27pt stencil computation we concern

The 27-pt stencil loads twenty-seven points first and then does the update before writing the updated value into another 3D array, which means that two separate 3D arrays are required. To be specific, the 27 points are portioned into one central point, six face points, eight corner points, and 12 edge points according to their locations and distances from the center point. The 125-pt stencil has a similar memory accessing behavior. Besides, we assume that each point has a diverse coefficient for the two stencils.

Table 6 Average stencil characteristics (ARM and Intel)

4.3 Architectural specific exceptions

Due to the hardware infrastructures and software compiler configurations, not all the benchmarks are evaluated under the same parameters. For the ARM platform, we only have 16 GB main memory, and the input grid size of \( 1024^{3} \) failed as a result. Besides, we only have GCC (GNU Compiler Collection) on the platform, the NEON and other related options for the ARM compiler are neglected. For the Intel platform with only two NUMA nodes, the scalability experiments are not like that on the ARM platform with eight NUMA nodes. However, it does have a broader range of problem sizes with a 64 GB main memory. Besides, we conduct single-precision experiments on the ARM platform and double-precision ones on the Intel platform separately for the NEON and AVX vector length difference. Other differences include the OpenMP scalability within a node for one of them has eight cores in a NUMA node and the other ten.

4.4 Results and analysis

We conduct the experiments according to the aforementioned configurations. Results for the ARM and Intel platforms are demonstrated in Figs. 5 and 6, respectively. We demonstrate three versions of the two stencil computations under diverse input grid sizes: the pure OpenMP version OMP, the hybrid MPI + OpenMP version MPI+OMP, and the SIMDlized version MPI + OMP + NEON + /AVX. The speedup is illustrating the effectiveness of the SIMD Extensions on each platform and is calculated between the MPI+OMP version and the MPI+OMP+NEON/AVX version.

4.4.1 ARM platform

Experimental results of the ARM platform present the scalability of the NUMA nodes. What can be observed is that the OMP version utilizing only OpenMP can only scale with cores within a NUMA node. Meanwhile, the MPI+OMP version presents scalability both within and among NUMA nodes. The best performance for the 3D-27pt stencil is 123.41 GFlops and the 125-pt one 117.89 GFlops. The NEON speedup for the two benchmarks is also prominent with an average improvement of 1.40\( {\times } \) and a maximum of 2.00\( {\times } \) for the 27-pt stencil and 2.04\( {\times } \) and 2.38\( {\times } \) for the 125-pt one, respectively.

To be specific, the two stencils of diverse sizes show the same conclusion. For the processes utilized from 1, 2, 4, 8, and 16 to 32, the MPI+OMP version outperforms the OMP version with better performance, while the OMP version performs no scalability. It is specially obvious when the processor cores used are among diverse NUMA nodes. For the ARM platform, one NUMA node consists of eight processor cores. Taking the 27-pt stencil of size 128 in Fig. 5a for example, the OMP version performance almost has no change from 16 cores to 32 cores. The MPI+OMP version has a speedup of about 2.0 with all the other configurations the same.

Fig. 5
figure 5

Experimental results for ARM

4.4.2 Intel platform

We also obtain good scalability for the two NUMA nodes and within a NUMA node by combining OpenMP and MPI for the Intel platform. We obtain the best performance for the 3D-27pt stencil at 113.06 GFlops and 159.26 GFlops for the 3D-125pt one, 29.44% and 41.47%, respectively, of the peak performance of the platform. The AVX speedup for all the benchmarks and input grid sizes behaves well with an average speedup of 1.97\({\times }\) and a maximum speedup of 2.17\({\times }\) for the 27-pt stencil and 3.51\({\times }\) and 4.17\({\times }\) for the 125-pt one, respectively.

Similar results occur on the Intel platform. For the Intel platform, one NUMA node consists of ten processor cores. Taking the 27-pt stencil of size 128 in Fig. 6a for example, the OMP version performance almost has no change from ten cores to 15, and 20 cores. The MPI+OMP version has a speedup of about 2.0 with all the other configurations the same. These results show the effectiveness of our proposed multilevel parallelism optimization scheme.

Fig. 6
figure 6

Experimental results for Intel

5 Related work

To our knowledge, some correlated work was conducted. [15], and [16] proposed a DSL-based framework that automatically generates parallel high-performance stencil codes with both productivity and performance benefits within the ORWL paradigm, which is an intertask synchronization model for iterative data-oriented parallel and distributed algorithms that uses strict FIFO ordering for the access to all resources. Experiments were based on Grid’5000 [17], which is a large-scale and flexible test bed for experiment-driven research. [18] researched a multilevel autotuning framework that includes data allocation, domain decomposition, low level, and autotuning techniques. Some of their strategies, especially the decomposition schemes, are similar to ours. However, their experiments are conducted on SMP platforms using Pthreads. [19] discussed the motivation of combining OpenMP and MPI, and coarse-grained OpenMP parallelism is used to facilitate overlapping MPI communication and computation for stencil-based grid programs on an IBM SP. [20], and [21] studied the data movement strategies for heterogeneous architectures with distributed memory with the Polyhedral model on an AMD CPU and an NVIDIA GPU, which immensely reduced the volumes of communication between computing devices. [22] investigated on the hybrid MPI/OpenMP parallel programming on clusters of multicore SMP nodes. Specific cases for pure MPI, pure OpenMP, or hybrid programming model of MPI + OpenMP are analyzed separately. [6] demonstrated the benefits of the hybrid approach for performance and resource usage on three multicore-based parallel systems, including one IBM, and two SGI Altix systems, with two real-world applications. It is the first time we fuse the hybrid programming model and multilevel parallelisms on NUMA and SIMDlized architectures. We make full use of the whole platform’s potential benefits, including the core, socket, and system-level architectural characterizes. Putting all the considerations together, we evaluate two commonly utilized stencil computations and exploit their best performances.

6 Conclusions and future work

We propose and implement as well as evaluate a multilevel parallelism scheme for stencil computation optimizations on NUMA architectural multicore platforms with SIMD supported. It is motivated by the fact that the systems we work on are organized by diverse memory systems, including local memory and remote memory. A hybrid programming approach of MPI and OpenMP is employed. We utilize a four-level parallelism optimization strategy including the block decomposition for NUMA nodes and processes combined with MPI, the thread-level parallelism using OpenMP, and the data-level parallelism based on SIMD Extensions. Experimental results of two ubiquitous 3D stencil computations demonstrate the effectiveness of our proposed scheme. Future work may include the extensions of many more hardware platforms as well as stencil computations. Multinode or clusters of processors should also be considered for the performance scalability. Other optimizations including cache blocking [23,24,25], and [26], register blocking as well as instruction-level parallelism [18] could also be integrated into our scheme to better improve the performances.