Keywords

1 Introduction

Performance portability and programmability of scientific computing applications is gaining importance as hardware becomes more heterogeneous. With GPUs now commonplace in scientific computing, the landscape of multi-core CPUs is also diversifying with x86_64 and amd64 joined by POWER and ARM. Porting scientific codes to new hardware costs valuable developer time due to the large semantic gap between hardware-specific programming models.

Typically in High-Performance Computing (HPC), semantic gaps are addressed by abstraction [7]. This has been successfully demonstrated also for performance portability, for example by libraries like Kokkos [8], Alpaka [9], and RAJA [1], as well as Intel’s OneAPI built on top of SYCL [6], providing abstractions to execute code across hardware platforms. While providing a good variety of data structures as containers, these libraries have limited memory layout restructuring capabilities, in particular if an object is not a primitive type. Libraries like LLAMA [2] provide complex memory layout restructuring across hardware platforms, but are limited to multi-dimensional arrays as containers. Moreover, most of the existing libraries currently lack the capability of combining data structures with tuple-based layout switching, and all of them lack support for sparse data structures or the possibility to automatically serialize/deserialize arbitrarily nested data structures.

Here, we address this gap by providing an open-source memory- and compute-abstraction library that supports arbitrarily nested and sparse tuple data structures mapped to different memory layouts, as well as commonly used basic algorithms tuned for performance on a variety of hardware targets. Our library is implemented using C++ tuples (see Sect. 2) for compile-time code generation of generic scalar, vector, and tensor multi-dimensional arrays, in addition to more complex data structures like compressed-sparse-row graphs, cell lists, and arbitrary-dimensional sparse block grids [3]. Our library uses memory mirroring to support data structures that simultaneously exist on both device and host, enabling user codes to, e.g., have CPU and GPU sections share an abstract data structure simultaneously mapped to both memories. We provide optimized algorithms along with the data structures, e.g., for arbitrary-dimensional convolutions, sorting, prefix sum, reduction, and scan (Sect. 3).

The presented library, openfpm_data, is available as part of the OpenFPM scalable computing project [4]. It provides the shared-memory layer of OpenFPM, but can also be used as a stand-alone library. It provides two interfaces for user-implemented algorithms over abstract data structures: CUDA-like compute kernels and lambda functions. Since openfpm_data is able to shape pointers to external memory, zero-copy interfaces are possible with other libraries that provide algorithms or shape memory, like Kokkos [8] or LLAMA [2], supplementing them, e.g., with sparse grids, graphs, or neighborhood search.

We show in micro-benchmarks and in a real-world application that the flexibility afforded by openfpm_data does not impact performance (Sect. 4). Indeed, we find that combining memory layout restructuring of complex data structures with generic algorithms can benefit the performance optimizations of modern C++ compilers on multiple CPU and GPU architectures. We conclude the paper in Sect. 5.

2 From C++ Tuples to Compile-Time Data Structures

We construct memory-layout reconfigurable data structures with a common abstract programming interface by exploiting two features of the C++ programming language: The first is the existence of three types of brackets — <>, (), and []. We use them to cleanly separate the semantics of data structures. Angle braces are used to specify which property of a tuple/composite data structure one wants to access. Round parentheses are used to specify an element of a discrete set. Square brackets are used to access individual components of a vector or array. This three-brackets access semantic is common across all data structures and independent of the physical memory layout used.

The second C++ feature we use are tuples (and consequently variadic templates). We use the tuple data structure provided by the Boost libraryFootnote 1 to define properties or elements of a data structure. Using tuples instead of structs enables content parsing at compile-time using template meta-programming. The memory layout (or memory mapping) of a data structure is determined at compile-time by a layout restructuring algorithm implemented using meta-programming. We then construct an object that stores the information of a container with the specified layout and inject the access methods with layout-specific code required to overload the three parenthesis operators.

Fig. 1.
figure 1

Summary of the openfpm_data library: The UML diagram on the left lists the implemented containers and their composition, starting from multi-dimensional arrays, with template parameters as listed in the right box. The first template parameter (green) is the tuple defining the data type of the container. The memory layout is defined in the second parameter (red). The linearization of multi-dimensional indices is defined by the third template parameter (violet). The fourth template argument (yellow) defines the type of memory to be allocated: GPU device (Nvidia or AMD) or heap memory. The three dots outside the box indicate the possibility of the interface to be extended to user-defined layouts, linearizations, and memory types. (Color figure online)

The data structures and memory layouts available in openfpm_data are summarized in Fig. 1. The UML diagram on the left shows the composition of the available containers, starting from the base class “multi-dimensional array”. A vector is a one-dimensional array, a Compressed Sparse Row (CSR) graph is stored in an encapsulated vector of vertices and edges, a map is a sorted vector, and a sparse grid is an n-dimensional map [3]. All sub-classes inherit the layout reconfigurability of the base class as defined by the four template parameters shown in the right box. Every container in the hierarchy can override every layout parameter, leading to a vast diversity of possible implementations.

Fig. 2.
figure 2

Example to illustrate the classes involved in accessing an element of a Struct-of-Arrays (SoA) container in GPU memory with standard C++ striding linearization for the () operator. The figure illustrates how the method grid.get<stress>(element)[x][y] is implemented across classes using the three bracket types of C++. Colors of arrows and parameters match the parenthesis and in-parenthesis parameter colors. In the example of the figure, the component [x][y] (two-dimensional tensor index) of the element (element) of a named property <stress> is accessed. This is how one would access the components of a stress tensor field in a fluid mechanics simulation. The operator () is overloaded by grid_sm (green arrow), which converts the multi-index to an integer (orange) using standard C++ striding. This integer is passed to multi_array_ref_openfpm, which overloads the [] operator. The class memory_traits_inte implements the interleaved memory layout for SoA with memory allocated on the GPU in the CudaMemory object. (Color figure online)

Figure 2 illustrates the mechanism used for memory mapping and for abstract layout switching. In the example of the figure, the object memory_traits_inte implements the meta-algorithm to transform a tuple into a multi-dimensional container object with interleaved (i.e., SoA) memory layout, and it contains the code for the parentheses functions. In the figure, this is shown for the get method on a multi-dimensional array named grid to access tensor component [xy] of a certain element of a container called stress (e.g., the stress tensor field of a fluid mechanics simulation). All layout-specific code is encapsulated in the objects that overload the parenthesis operators, as indicated by the colors.

All openfpm_data data structures support memory mirroring to use host and device memory simultaneously. Mirrored data structures simplify code where some sections run, e.g., on a CPU and others on a GPU. However, openfpm_data does not provide any memory consistency model. Synchronization of a mirrored data structure needs to be triggered by the user when needed. Functions to transparently move data from device to host and vice versa are provided.

3 Generic Algorithms over Abstract Data Structures

We complement the hardware-independent data structures and memory layout capabilities of openfpm_data with generic algorithms, which are translated to optimized hardware-specific implementations at compile time. We further expose two different interfaces for user-implemented algorithms: a CUDA-like kernel interface and a lambda function interface.

In order for openfpm_data kernels to run on multiple hardware backends, we provide hardware-native implementations of the following algorithmic primitives: prefix sum, atomic add, stencils, n-dimensional convolution, adding and removing elements from maps, data structure copying and merging, sorting, segmented reduce, in-warp reduce, and cell lists. These implementations are encapsulated in switchable back-end objects that determine their implementation. At the time of writing, the following four backends are available: CUDA (Nvidia GPU), HIP (AMD GPU), SEQUENTIAL (CPU), and OpenMP (CPU). The backend is chosen by the user at compile time.

For the CUDA and HIP backends, the openfpm_data algorithms directly wrap the corresponding implementations in CUDA/HIP via the CUB/hipCUB API. The SEQUENTIAL backend executes each thread block sequentially on the CPU. Then, __global__ and __device__ map at preprocessor level to an empty string and an inline, respectively, and blockIdx, blockDim, threadIdx, and gridDim are global variables.

User-implemented algorithms can be written as CUDA-like compute kernels or using a lambda interface. Like in CUDA, openfpm_data kernels are labeled with the attribute __global__, and device functions are labeled with the attribute __device__. Also like in CUDA, computation is divided into a grid of blocks, where each block contains a user-defined number of threads. Within a kernel, openfpm_data provides the local variables blockIdx, blockDim, threadIdx, and gridDim that contain the thread block index, dimension, the thread index within the block, and the number of blocks in the grid. Static shared memory is available via __shared__, and __syncthreads() is implemented with lightweight threads (number of threads = size of the thread block; each thread has 8 KB stack, extensible via a compile-time parameter) and fast context switching. Every time __syncthreads() is encountered, execution is stopped and a context switch is performed, moving to the next lightweight thread. While this leads to sub-optimal performance, it provides a direct mapping for user-defined kernels where no backend-native implementation is available to at least run (e.g., for debugging). When reaching the end of a block, the first lightweight thread in the block is resumed in a cyclic way.

For the SEQUENTIAL backend, lightweight threads are created internally, while fast context switching is performed using the Boost library’s boost::context. Because lightweight threads are not concurrent, atomicAdd reduces to a regular addition operation. A block scan is implemented as a __syncthreads() followed by the calculation of the exclusive prefix sum for thread zero in the block and a final __syncthreads().

In the OpenMP backend, blockIdx and threadIdx are marked thread_local and use thread-local storage (TLS) in order to have an independent copy for each thread. Blocks are distributed across OpenMP threads, with each thread of a block executed by one OpenMP thread. If blocks do not use __syncthreads(), the backend switches to non-lightweight threads to help vectorization.

To illustrate the similarity of the openfpm_data kernel programming interface with CUDA, List. 1.1 shows the first part (defining the shared memory and loading the fields) of the miniBUDE benchmark [5] implemented as an openfpm_data kernel that can run on both CPUs and GPUs.

figure a

For lambda-based computation, openfpm_data supports directly launching a lambda function similar to libraries like Kokkos [8], RAJA [1], and SYCL [6]. The blockIdx and threadIdx constants are passed to the function as arguments. This implies that TLS for the OpenMP backend is not required, because blockIdx and threadIdx are local function arguments rather than global variables.

4 Benchmarks

Table 1. Hardware/compiler combinations considered for the benchmarks.

We profile the memory and compute performance of openfpm_data in micro-benchmarks, and we demonstrate the use of the library in a real-world application from computational fluid dynamics. All benchmarks are performed on the hardware and using the compilers listed in Table 1. Benchmarks for sparse data structures are available elsewhere [3]. We only benchmark the OpenMP (on CPUs), CUDA (on Nvidia GPUs), and HIP (on AMD GPUs) backends of openfpm_data; SEQUENTIAL is always slower and only intended for debugging or porting purposes. Each measurement is repeated several million times to compute means and standard deviations.

4.1 Memory Performance

We first analyze the memory performance. We do so using a micro-benchmark that moves data between aggregates/tuples containing scalars, vectors, and rank-two tensors. Because this benchmark is memory-bound, it assesses the memory performance portability of the openfpm_data aggregates/tuple data abstractions. We evaluate the results both absolutely and relatively. For the relative evaluation, we compare against a hand-tuned implementation in Kokkos [8] and a C++ plain-array implementation. For the absolute evaluation, we compare the memory bandwidth achieved by openfpm_data with the synthetic benchmarks babel-STREAM (for POWER 9, ARM, and dual-socket x86_64), pmbw (for single-socket x86_64—an optimized memory bandwidth benchmark written in assembly), and vendor-specific memory copy functions for the GPUs, as well as with the theoretical peak memory bandwidth reported in the data sheets.

We perform the benchmark on 67.1 million elements, each containing a scalar, two 2-vectors, and a tensor of rank two and size \(2 \times 2\). We repeat each benchmark both for reading and for writing. The write benchmark reads one element from component 0 of the first vector and copies it into component 1 of the first vector, the scalar, all four components of the \(2\times 2\) tensor, and all components of the second 2-vector. This requires a total of nine memory accesses (counted from the generated assembly code): 8 write and 1 read. The read benchmark reads the values from the first 2-vector, the scalar, the tensor, and component 0 of the second vector, sums them, and writes the sum into component 1 of the second vector. This results in a total of 8 reads and 1 write. In this benchmark, we use lambda-based openfpm_data implementations compiled for the OpenMP backend on CPUs and for CUDA/HIP backends on GPUs. Memory bandwidth is calculated as the number of access operations divided by the runtime to complete all of them. The results are shown in Table 2.

Table 2. Memory performance (read/write) on different hardware in Gigabytes/second (GB/s) for the same memory transfer micro-benchmark (see main text) implemented in openfpm_data, Kokkos, and plain C++ arrays, compared with the synthetic memory benchmarks described in the text and the vendor-provided memory bandwidth from the data sheet, where available. All synthetic benchmarks except pmbw (for Ryzen 3990X) and data sheets only report composite read/write bandwidth. For all measurements, the standard deviation is \(<3\%\) and therefore not shown.

On the x86_64 CPUs, the measured memory bandwidth when reading is significantly larger than when writing. This suggests the use of a cache policy of type write_allocate rather than write_around. In write_allocate, a write to a memory location out of cache generates a cache line that is filled from memory. Eventually the line is written back, causing double transfer of data compared to a read. The GPUs appear to implement a write_through cache policies. On all platforms, the memory performance of openfpm_data is comparable to that of plain C++ arrays (Table 2). With the exception of the M1 and the POWER 9, the numbers also match the synthetic benchmarks, confirming that the memory-mirrored tuple abstraction of openfpm_data incur low performance overhead.

To confirm that memory layout reordering does not interfere with the optimization stages of the compilers tested, but indeed helps the compilers vectorize the code, we check the generated assembly code. An example for an SoA memory layout is shown in Fig. 3 for clang 13. The analysis shows that even when combining tensor components of rank two with vector components, the compiler is able to understand the contiguity of the index for the parenthesis () and to generate AVX instructions without further hints. This shows that the openfpm_data abstractions do not interfere with the optimization stages of the compiler, and the thread model still allows for vectorization.

4.2 Compute Performance

In order to benchmark the compute performance of openfpm_data, we use the miniBUDE performance benchmark [5], which has previously been used to compare compute performance of programming models including OpenCL, Kokkos, CUDA, SYCL, OpenMP, and OpenACC. While this benchmark does not over-stress the data structures, it quantifies the performance portability of the algorithms provided by openfpm_data. We do so by running the miniBUDE CUDA benchmark kernel through openfpm_data’s kernel interface. The openfpm_data compute kernel remains the same across all benchmarks, but is compiled using different backends: CUDA on Nvidia GPUs, OpenMP on CPUs, and HIP on AMD GPUs.

Fig. 3.
figure 3

C++ code reading from a vector component and writing to a rank-two tensor. As seen from the assembly code generated by clang 13, both the reads and writes are vectorized, processing 8 floats in one instruction. Then, the counter in the register %rax is incremented by 16 Bytes, the stop condition is checked, and the loop iterates to label be8. The numbers to the left of the vertical line indicate the percentages of profiling samples collected from each instruction.

In order to render the results reproducible and comparable across compilers, we manually enable DAZ (denormals are zero) and FTZ (flush to zero) on all hardware. This does not affect significantly the values computed, but prevents compilers from using different SIMD mask flags with different compilation options.

Table 3 reports the relative performance of the same openfpm_data code on different hardware compared with the respective best performer from the miniBUDE test suite, as indicated in the last column. Despite the fact that the openfpm_data kernel was not manually changed or tuned for the different hardware targets, it mostly performs on par with the specialized CUDA or OpenMP implementations of miniBude. The only exception is the RXVega 64, where OpenCL is faster than openfpm_data with HIP backend. Code inspection shows that this is because the two compilers produce different code: HIP produces code with fewer registers and higher occupancy, while OpenCL does the opposite. While it is counter-intuitive that this explains the performance difference, it is what the measurements show, and it possibly hints at latencies or GPU stalling as the problem for openfpm_data on the RXVega 64.

4.3 Application Example: Smoothed Particle Hydrodynamics

We demonstrate the use of openfpm_data in a typical real-world application from scientific computing: a computational fluid dynamics simulation using the numerical method of Smoothed Particle Hydrodynamics (SPH). As a baseline, we use the CPU-only implementation of SPH from the original OpenFPM paper [4], which is freely available in the OpenFPM repository, albeit without the CPU-specific manual optimizations (like Verlet lists and symmetric interactions). We derive from this code a version implemented using the CUDA-like kernel interface of openfpm_data and the built-in algorithmic primitives cell-list, sort, and scan.

Table 3. Performance of the same miniBUDE-like openfpm_data kernel on different hardware compared with the respective best performer of the miniBude benchmark [5] as given in the last column. Values are given as relative performance (GFlops openfpm_data)/(GFlops best miniBude) as mean ± standard deviation over 30 independent trials. Values >1 (in bold) mean that openfpm_data was faster than the fastest miniBude implementation.

We use both codes—the original MPI-only CPU code [4] and the code using openfpm_data kernels—to simulate the same “dam break” SPH test case [4].

Table 4 shows the measured relative performance of these two codes on different CPUs. Performance is reported as runtime ratio (original MPI code)/ (openfpm_data code) in percent for the OpenMP backend of openfpm_data. Therefore, numbers >100% (in bold) indicate speedup. The most expensive part of the simulation, the force calculation step, is also profiled separately.

Table 4. Performance of the openfpm_data SPH “dam break” simulation on different CPUs using all available cores, relative to the performance of the original MPI code [4] on the same CPUs (=100%). Numbers >100% (in bold) indicate speedups.

The results show that the openfpm_data abstraction layer adds no detectable performance penalty in this complex real-world application. It actually being a few percent faster than the original MPI code is likely because the OpenMP backend has a lower communication overhead than MPI. The openfpm_data code also runs on GPUs. On a Nvidia A100, it runs 36 times faster than on all cores of an EPYC 7702 CPU, and on a RXVega 64 the speedup is 2.7. This difference in speedups is expected, as profiling shows the bottleneck for this application to be memory access and L2 cache. The Vega has slower memory than the A100 (484 GB/s vs. 1.5 TB/s) and 10x less L2 cache (4 MB vs. 40 MB). In addition, the Vega uses the old GCN architecture, known to be less efficient than AMD’s new CDNA architecture.

5 Conclusions

We have presented and benchmarked a C++14 memory and compute abstraction library for scientific computing applications on CPUs and GPUs. The presented library, called openfpm_data, combines shared-memory data structures with reusable algorithmic building blocks. Compared to the state of the art, openfpm_data provides more flexible memory layouts with tuples, memory mirroring, and advanced data structures like cell list, sparse grids, and graphs.

We have shown the benefits this brings for performance portability in both micro-benchmarks and a typical real-world numerical simulation application, comparing to the respective state of the art. The presented benchmarks have also shown that memory layout switching using memory-mirrored C++ tuples does not interfere with performance and does not distract compiler optimizations.

The algorithmic primitives provided by openfpm_data include n-dimensional convolution, merging, sorting, prefix sum, reduction, and scan. They are available in optimized implementations for CUDA, HIP, SEQUENTIAL, and OpenMP backends and can be used and extended in either a CUDA-like kernel programming interface or a lambda-based interface. This allows scientific codes to run on different hardware platforms without losing performance, as demonstrated in the SPH fluid-flow simulation example.

The abstract data structures provided by openfpm_data are composable and can be used as building blocks for more complex data structures, such as distributed sparse block grids [3], and for domain-specific data structures like those in OpenFPM [4]. The memory layout capabilities are inherited, as well as the memory mirroring capability, allowing the same data structure to simultaneously be mapped to host and device. Moreover, third-party libraries can be interfaced via external memory and pointer shaping.

The scalable scientific computing framework OpenFPM [4] is based on the openfpm_data abstraction layer presented here. The OpenFPM project composes the shared-memory openfpm_data abstractions to distributed-memory objects for multi-node and multi-GPU applications with transparent network communication. The portable openfpm_data data structures and kernels enable OpenFPM to transparently run simulation codes on multiple architectures.