Keywords

1 Introduction

For the past decades, the main trend has been to enhance the compute capabilities of processors through frequency increase, functional unit extension (e.g. SIMD) or core duplication. It leads to the current generation of supercomputer nodes design with several multi-core processors linked together. But this large spectrum of compute capabilities puts the stress on the memory part to keep feeding such functional units. For example, SIMD operations may require more memory bandwidth to enable issuing one instruction requiring a large number of register inputs. However, proposing a new memory type with a larger bandwidth, for the same overall cost, exposes a smaller storage. That is why various kinds of memory appeared in the HPC community. For example, Intel launched the Knights Landing many-core processor  [26] which embeds a high bandwidth stacked memory named MCDRAM. The next generation of such an approach is called HBM and will be available in some processors like ARM-based Fujitsu A64FX  [30]. This approach improves the throughput of bandwidth-hungry units, but there still is a need for a large storage with lower bandwidth but better performance than regular disks. This is why the notion of persistent memory appears in clusters like the flash memory NVDIMM technology  [15].

While these new memory types may fulfill the requirements of many parallel applications (HBM/MCDRAM for compute-intensive parts, NVDIMM for I/O dedicated portions), allocating data in these target memory spaces in a portable and easy way is tedious. Thus, memory management is becoming one major concern for HPC application developers. Since version 5.0, OpenMP tackles this issue by introducing memory management extensions  [22]. It is now possible to control data allocation and placement on specific memory spaces through OpenMP constructs. For this purpose, OpenMP defines multiple memory spaces: application developers have then to specify some parameters (traits) for a specific allocator (e.g., data alignment, pool size, \(\ldots \)).

This paper proposes a first experience of implementing these OpenMP memory management constructs. It makes the following contributions:

  • Preliminary implementation of OpenMP memory management constructs targeting DRAM, MCDRAM and NVDIMM into the MPC framework  [7]Footnote 1 (a thread based MPI implementation with a OpenMP 3.0 runtime system)

  • Port of a C++ mini-application with effort to support STL objects.

  • Experiments on portability with various target architectures exposing different memory types.

This paper is organized as follows: Sect. 2 exposes an overview of the OpenMP specification for memory management. Section 3 presents related work in this area. Section 4 details our approach to implement these OpenMP constructs into the MPC framework and enable their support inside an application. Finally, this papers illustrates experimental results in Sect. 5 before concluding in Sect. 6.

figure a

2 Memory Management in OpenMP 5.0

OpenMP 5.0 introduces constructs and API routines to manage portable data allocation in various memory banks. Thus it defines a set of memory spaces (omp_memspace_handle_t) and parameters (traits) that can affect the way data are allocated in this target memory (omp_alloctrait_t). Even though each implementation can propose its own spaces and traits, OpenMP 5.0 defines default spaces (default, large capacity, constant, high bandwidth and low latency), and allocator traits (such as alignment, pool size and fallback). Thus, the user has to create an allocator handle (omp_allocator_handle_t) by specifying a target memory space and a set of traits. This operation is performed by calling the initialization function named omp_init_allocator. Listing 1.1 highlights this process at the top part.

After this setup, the application can allocate data with this new allocator. The runtime will thus be responsible for allocating data into the target memory bank with the specified trait values. There are two ways to manage data allocations with OpenMP: functions and directives. The first method is to call the functions omp_alloc and omp_free. Both take an allocator handle as input (which should be initialized first). Listing 1.1 shows this approach.

figure b

The second method relies on the allocate directive and clause (see Listing 1.2). The directive takes a list of data variables to allocate through the handle specified in the allocator clause. This clause can be used in several constructs such as task, taskloop and target. Users have to specify in this clause the handle to use to allocate data, and the list of data variables. The handle allocator can be defined by the user or a predefined allocator. There is one allocator given by the OpenMP standard per memory space.

3 Related Work

This section details different approaches to deal with multiple memory levels inside an HPC compute node: dedicated allocations, portable allocations and OpenMP implementations.

Dedicated Memory Management. The first approach deals with dedicated interfaces to allocate data inside a specific target memory type. Even if some memory kinds can be configured as a cache level to enable automatic hardware-driven management (e.g. MCDRAM in Intel KNL  [9, 18, 21]), fine-grain data allocation can lead to better performance. Thus, some research papers set up this MCDRAM as flat mode meaning that a specific action is required to put data into this target memory. This action may have a coarse-grain scope (e.g. relying on the numactl library  [8] or forcing the global memory-placement policy  [19]), or a more fine-grain approach (e.g. using memory allocators like memkind  [6] to deal with placement on a per-allocation basis  [5]). Similar fine-grain initiatives also exist for other types of memory e.g., persistent  [2, 16, 25].

Portable Memory Management. Runtime systems are already able to deal with memory management for performance portability concerns  [1, 4, 12]. This list is not exhaustive as multiple approaches exist in this domain, especially focusing on heterogeneous systems. Some other initiatives also deal with high bandwidth memory management like MCDRAM for KNL processors. These forms of memory management have been explored within domain specific languages in  [24].

High level programming interfaces are widespread in the HPC community. Previous works such as  [10, 11] have to deal with memory allocation in a portable way, but only for the GPGPU concerns now. These interfaces enable abstract memory allocations through wrapper functions or objects.

OpenMP Memory Management. While some initiatives have been already proposed for memory management in a portable way, there is no standard way to do it. OpenMP now provides a way to standardize memory management for a wide system spectrum. Based on directives and functions, software developers are able to address memory allocations in an easy way: they do not need to deal with the actual low-level allocation method into a specific memory. As far as we know, LLVM  [20] has the most up-to-date OpenMP runtime implementation for the support of memory management constructs. Even if this runtime system is well advanced, the front-end part of supported compilers (Clang and Intel) does not support the full specification. Until now, it supports allocation in standard and high bandwidth memory banks only with few traits (e.g. pool_size, fallback and alignment). For the high bandwidth memory support, LLVM forwards memory allocations to hbw_malloc from the memkind library.

Paper Positon: While our work is similar to the LLVM approach (design and implementation of memory-management constructs inside an existing OpenMP runtime), the objective of this paper is to give a preview our implementation of multiple memory levels (DDR, MCDRAM/HBM and NVDIMM) its portability aspects. Moreover, we have experimented how to integrate those OpenMP functions in a portable solution with a C++ application.

4 Application- and Runtime-Level OpenMP Memory Management

This section presents the main contribution of this paper: the support of multiple memory types inside an existing OpenMP implementation (Sect. 4.1) and the port of a C++ mini-app (Sect. 4.2).

4.1 Runtime System Design for Memory Management Integration

MPC  [23] is a thread-based MPI implementation which proposes an OpenMP  [7] runtime system. It is compliant to OpenMP 3.1 and partially supports version 4.5. As MPC integrates its own NUMA aware allocator, we have decided to design and implement the memory management constructs inside this framework. MPC is compatible with GNU and Intel compilers for OpenMP lowering and thread-based specific features. But these compilers have currently a limited support of the allocate directive and clause. Thus, our work focuses on providing initialization functions and allocation/deallocation calls (omp_alloc and omp_free) for multiple memory types: DDR, high bandwidth MCDRAM and large-capacity NVDIMM. For this purpose, it is necessary to enhance the existing implementation with an advanced hardware topology detection and an approach for initializing and allocating data. This section details those steps.

Automatic Memory Banks Discovery. OpenMP offers a set of predefined memory spaces with a way to fallback if the application tries to allocate in non-existing type. Thus, the first step is to discover available memory banks on the target machine. For this purpose, a hardware detection tool has to be integrated into the OpenMP runtime system to list available memory spaces on a system at execution time. Most of the runtime systems already rely on the hwloc library for hardware topology discovery and thread binding. Recent work  [13] adds the support of heterogeneous memory types such as high bandwidth memory (e.g. MCDRAM from Intel’s KNL processor) or large capacity memory bank like NVDIMM technology. On such machines, MCDRAM memory space is exposed as a no-core NUMA node with a special attribute named MCDRAM. Checking the presence of MCDRAM in a system is thus possible by browsing all the NUMA nodes and searching for the one which has this defined hwloc attribute. Large capacity memory spaces such as NVDIMM memory are viewed by hwloc as OS devices, and tagged with a special attribute. So, it is possible to detect such memory banks by listing all the OS devices and searching for large capacity memory banks.

The basic block of our design relies on the hardware detection module based on hwloc in the MPC framework. It is called at runtime initialization and for every runtime entry point that is not in a parallel region to detect and save available hardware components.

Memory Management Initialization. The allocation process is separated into two parts: the initialization of allocators and the data allocation (as seen in Listing 1.1).

First of all, the omp_init_allocator function initializes some user-defined traits in a omp_alloctrait_t structure, and link them into a memory space in a omp_memspace_handle_t structure. From a runtime point of view, it is necessary to save a dynamic collection of allocators. By default, in our implementation, this structure contains pre-defined allocators only. Its size can then be enlarged to allow users to create new allocators with some specific traits. This structure maps an allocator handle (i.e. omp_alloc_handle_t) to an allocator structure (i.e. omp_alloc_t). This collection is only accessible from each thread in a read-only access mode. Indeed, we do not ensure thread safety for this structure yet. Thus, all the allocators have to be initialized outside a parallel region. We are currently working on thread safety to enable the creation of allocators inside parallel regions. In this way, concurrent threads may benefit from allocators inside parallel regions within the omp_alloc function.

Various traits are proposed in the specification, such as memory alignment as illustrated in Listing 1.1. Traits have a default value, but the initialization process may set them user-defined values. The MPC framework currently supports the alignment and fallback traits. Future work is needed to support more traits in the framework.

Fig. 1.
figure 1

omp_alloc call procedure

Data Allocation Process. The final step is to implement the data-allocation mechanisms. Figure 1 sketches this process. For this purpose, an allocator can be retrieved from the collection based on the handle structure. Depending on the trait values, the data allocation process is different. For example, we test the value of the fallback trait if the runtime detects that the requested memory space is unavailable. Trait checking is quite similar no matter what the selected memory space is. Once the traits values are set, we need to link a memspace to an allocation function. Depending on the desired memory space, this function differs. For example, malloc from glibc is the function used to allocate data in DRAM memory space (denoted as omp_default_mem_space) while hbw_malloc from the memkind library enables data allocation in the high bandwidth memory space (denoted as omp_high_bw_mem_space). To support all the set of the specification defined memory spaces, runtime developers may have to integrate various allocator library inside the OpenMP runtime. For example, data allocation in high bandwidth or in large capacity memory spaces are currently well supported within the memkind library.Footnote 2 The MPC framework comes with its own NUMA-aware allocator  [28] based on kernel page reuse. For more convenience, and to avoid multiple library integrations, we link our allocator to the OpenMP runtime. Data allocations in DRAM (referred to as default memory space) and in high bandwidth memory space are thus operated by our allocator. As MCDRAM is currently detected as a no-core NUMA node, we redirect all the dynamic allocations queries to this NUMA node. We also support data allocation in large capacity devices like NVDIMM. For this purpose, we have integrated the nvmemFootnote 3 library inside the MPC framework.

The deallocation process is quite similar to the allocation one. When omp_free is called, we check the memory space specified in the allocator structure and then call the appropriate data free function.

figure c

4.2 Enabling Portable Application Memory Management

After implementing partial memory management support into the runtime, it is necessary to modify the target application as presented in Listing 1.1. This section illustrates the case of LULESH  [17], an hydrodynamics application from the CORAL benchmark suite. This example provides some valuable experience on how to port an existing C++ application to use OpenMP memory management functions. While porting C codes leads to additional codes as shown in Listing 1.1, many C++ applications exploit STL objects  [27] (e.g. vector, stack, list, ...). Such objects manage allocators through a template parameter Allocator. We propose a custom allocator object (see Listing 1.3) that integrates the OpenMP omp_allocator_handle_t structure features. Thus, all the methods (e.g., constructor/destructor, resize or insert) use OpenMP allocation constructs. This example also illustrates that the new allocator has to be passed as a template parameter of the STL object and as input of the constructor (to indicate the right memory spaces to the allocator).

5 Experimental Results

This section illustrates our implementation inside the MPC framework on one benchmark allocating data in various memory banks based on the OpenMP 5.0 memory-management functions. For this purpose, we modified the LULESH benchmark (as previously explained in Sect. 4.2) by inserting calls to omp_alloc functions to allocate data in various memory spaces. Experimental Environment. On the hardware side, the target platforms cover the different memory kinds that our implementation supports. Thus we rely on 4 different systems. The first one is a compute node containing a 68-core Intel Knights Landing processor  [26], 16 GB of MCDRAM and 96 GB of regular DRAM memory. This configuration will be used to evaluate the allocation inside a high bandwidth memory. To test the large storage memory, we use a compute node equipped with two NVDIMM technology storages (with a capacity of 1.5 TB each). This persistent-memory node is composed of two 24-core Intel Cascade Lake processors (Xeon Platinum 8260L), each clocked at 2.40 GHz. Finally, two systems are available to ensure portability by exposing only regular DDR: one AMD ROME node (two 32-core AMD EPYC 7502 processors at 2.50 GHz with 128 GB of DDR) and one Intel Skylake node (two 24-core Intel Xeon Platinum 8168 Skylake processor at 2.70 GHz with 96 GB of DDR). On the software side, all benchmark versions were compiled with -O3 and linked to the MPC framework (configured with standard options) for the OpenMP runtime.

5.1 Coarse-Grain OpenMP Memory Allocation

This section describes the experiments conducted on the modified LULESH benchmark targeting a single memory space. Thus, it aims at testing the ability of our implementation to allocate data in a specified memory space following the OpenMP 5 standard.

Fig. 2.
figure 2

Coarse-grain data allocation management in MCDRAM

High-Bandwidth Memory. The first evaluation concerns the MCDRAM memory on Intel KNL node with 64 threads. Figure 2 shows the Figure of Merit (FOM - number of elements solved per microsecond) according to the mesh size on different versions of LULESH (with a fixed total number of iterations: 100). The first bar (standard) represents the regular run (everything is allocated in DDR) while the second bar (numactl) is controlled by the numactl command that places all data into the MCDRAM. The third version (omp_alloc) is modified with OpenMP memory management constructs. All of the three executions were compiled and run within the MPC OpenMP implementation. These results show only a 5% difference in performance between the execution with numactl and the application modified to use omp_alloc, for problem sizes from 30 to 200. There are no results for problem sizes greater than 200 for the numactl version because all the data does not fit in MCDRAM and the application stops. From 200 to 350, however, the omp_alloc execution can execute even though the allocation does not fit into MCDRAM, and the performance diminishes. This is due to the cache memory mode, as data does not fit in MCDRAM, the application performance is lead the DRAM bandwidth because some data are allocated in it. Performance of the original application without the use of numactl are lower than the two previous ones. In conclusion, we are able to allocate data in the MCDRAM high bandwidth memory bank with the OpenMP memory management functions. The performance difference between omp_alloc and numactl charts is due to the fact that numactl is much more aggressive and allocates all the data in MCDRAM. In our modified implementation, we only moved data dynamically allocated such as arrays and vectors, so not all the data are moved to MCDRAM.

Fig. 3.
figure 3

Coarse-grain data allocation management in NVDIMM

Large-Capacity Memory Space. Figure 3 presents the FOM running LULESH on a dual-socket 24-core Skylake node (i.e., 48 OpenMP threads) equipped with a NVDIMM device. While the two versions rely on OpenMP to allocate data, the first one (RAM) allocates all data in regular DDR (default allocator) while the second one (NVDIMM) changes the OpenMP allocator to target the large capacity space. With minor modifications, this graphic shows the ability to perform data allocation in large capacity memory devices like NVDIMM technology. As explained in  [14], the NVDIMM memory can be configured in two modes. Our results are similar for both selected memory spaces. We can conclude that the node is configured in a 2LM mode (i.e. similar to KNL cache memory mode): all the data allocation are thus directed to NVDIMM memory. Since no error message are emitted from the vmem library, we are assured all allocated memory is in NVRAM. We do not have any error message when using vmem library, which means that we are able to allocate memory in NVRAM.

5.2 Fine-Grain OpenMP Memory Allocation

An previous analysis  [3] of the LULESH benchmark has already determined the relevant data to be placed in high bandwidth memory bank. The purpose of this work was to detect which functions are bandwidth bound. Application performance can thus be improved if data operated in these functions are placed in memory banks with a higher bandwidth. This analysis demonstrated that some functions are sensitive to bandwidth, such as EvalEOSForElems, AllocateGradients and CalcForceNodes. We thus placed all the data relative to these function in omp_high_bw_mem_space (i.e. MCDRAM here) while the other data are placed in omp_default_mem_space (i.e. DRAM memory).

Fig. 4.
figure 4

Fine-grain data allocation management in MCDRAM

The objective of this experiment is to illustrate the benefit of fine-grain management for data allocation: i.e., choose in which memory bank to allocate through the omp_alloc function. Figure 4 presents the results of this approach (data selection) compared to allocating everything inside the MCDRAM (full mcdram). For this experiment, we change the number of iterations performed to 250 from the previous experiment while still varying the problem size. We can observe that allocating all data in MCDRAM is generally about 10% better between problem sizes of 75 and 250. Performance is about 50% less that the data selection method beyond problem sizes of 250. Below a problem size of 75, then are about the same. Performance decreases beyond and are worse than standard allocation model. These results show that selecting MCDRAM may suffer a bit in performance at small scale, it may be significantly important at large scale, where many HPC applications run. For all the tested configurations, we observe no performance decrease. However, performance seems to be bound to 6500.

In conclusion, as previously stated in several papers like  [5], we show that allocating all data in MCDRAM memory might not be the most relevant choice. Indeed, when data do not fit inside this high bandwidth memory bank, application performance is deteriorated, even more than without the use of MCDRAM. From this experiment, we aim to warn developers about data allocation. OpenMP offers ways to allocate easily, in a portable way, data in various memory banks. However, the strategy to select which data to move from one memory bank to another is still in charge of the developers. Currently, no runtime mechanism to automatically move data from memory banks exists.

5.3 Portabilty Across Hardware Platforms

The design of OpenMP memory-management constructs enables portability of applications whatever the available memory types on the target hardware. With the help of the fallback trait, data can be allocated in a default memory space (default_mem_fb) if the specified one does not exist. However, an application can also terminate if the fallback property is set to abort_fb. We propose here an experiment that highlights the portability of our OpenMP memory-management implementation. For this purpose, we ran the LULESH benchmark with the fine-grain data allocation strategy as sketched in the previous section (selected functions allocated in the high bandwidth memory space while the other ones target a default allocator). We executed it on several platforms, without any code modification. The selected machines are the ones composed of AMD EPYC, Intel Skylake and Intel KNL processors previously described in Sect. 5. All the runs were performed with 48 threads, and we fix the size problem to 350 for 100 iteration. We set the fallback trait to default_mem_fb to forward data allocation to DRAM memory space if high bandwidth memory is not found. Results are gathered in Table 1.

Table 1. FOM results

The execution on KNL achieve better performance than the two other machines. Indeed, as KNL platform benefits from MCDRAM, some selected data allocation are directed to high bandwidth memory based on OpenMP constructs. However, the two others do not have the MCDRAM memory type, so all data allocations are forwarded to classical DRAM memory without any application cancellation. We conclude that we are able to ensure portability with OpenMP memory management constructs. The main advantage of this approach is to achieve this result without any significant modification to the application, and maintain portability.

6 Conclusion and Future Work

This paper presents our experiences with the OpenMP memory management constructs at the application-level and the runtime-level. From the application side, developers should integrate data allocation calls with a standard like OpenMP, to provide portability. Through the LULESH benchmark, we have illustrated that these new constructs are easy to integrate. However, C++ STL objects users have to change the default allocator and implement a new one which encapsulates OpenMP function calls. We also have implemented these constructs into the OpenMP runtime of the MPC framework. While we do not support all the specification yet, we have implemented the major basic blocks into an OpenMP runtime system targeting various memory levels (DDR, MCDRAM and NVDIMM). For this purpose, we detail our implementation from hardware detection to data allocation process.

Our results show that this implementation is feasible and that there are advantages that provide performance improvements for user applications. We illustrate that it is easy to make portable applications with slight modifications. Our implementation is able to allocate data in default, high bandwidth and large capacity memory spaces. Our experiments also show that data allocations should be performed with care: the best strategy is not always to allocate all data in the fastest memory.

As a future work, we plan to support all the features provided by the specification, especially remaining traits. We also plan to work on coupling the OpenMP memory management constructs with the affinity clause. Since the last version of the OpenMP, 5.0, the specification introduces a new affinity clause in task directives to give some hints at scheduling time in order to enhance data locality. This clause has been already implemented and evaluated by others  [29]. Information from allocators will be needed from the runtime for the affinity clause. Indeed, this information can assist the task scheduler to make smarter decisions about affinity. In addition of that, it has a low memory footprint to keep this information and can significantly improve application performance. Thus we plan to evaluate this coupling to enhance the task scheduler.