1 Introduction

Modern High Performance Computing (HPC) has been defined as an era of extreme heterogeneity where an increasing number of accelerators support SIMD parallelism, spatial computing, or domain specific architectures. This is especially true as we move toward exascale, where the majority of pre-exascale and exascale systems are accelerator-based. For example, 7 of the top 10 systems in the Top 500 for November 2021 are GPU-based [1]. Recently, NVIDIA systems were the dominant accelerator which applications would target, but several next-generation systems will be based on accelerators from different vendors: Aurora and SuperMUC-NG Phase II, with Intel GPUs [2, 3] and Frontier, El Capitan, and LUMI with AMD GPUs [4,5,6]. Each vendor generally develops its own programming model and implementation which is optimized for its hardware. This design poses a challenge for application developers who wish to create portable code for multiple systems. Often this programming model heterogeneity results in application developers maintaining multiple branches of code in each different vendor-specific programming model, which increases code complexity and developer time requirements.

Heterogeneous-compute Interface for Portability (HIP) from AMD is one example of such a programming system that targets AMD and NVIDIA architectures. In this paper, we introduce HIPLZ: a compilation and runtime system that supports HIP via Intel’s Level Zero (L0) runtime [7] using the fat binary model for supporting multiple architectures and SPIR-V as an intermediate language (IL). To the best of our knowledge, HIPLZ is the first effort that bridges HIP to L0 which is the primary low level application programming interface (API) for Intel hardware.

In thispaper, we present the following contributions:

  1. 1.

    The prototype of HIPLZ, a library that allows applications using the HIP API to run on devices that support Intel Level Zero and OpenCL drivers. The source code is located at: https://github.com/jz10/anl-gt-gpu.

  2. 2.

    A test suite that covers the major functionality of HIP and that uses it as the validation of HIPLZ.

  3. 3.

    An evaluation of test coverage and code performance of HIPLZ on Intel Gen 9 GPUs. Our results show that HIPLZ supports the complete execution of 82% of tested applications and demonstrates performance parity with HIPCL and OpenCL for memory- and FLOP-focused benchmarks.

The paper is organized as follows: Sect. 2 gives background information about the HIP programming model, intermediate representation and the Intel L0 runtime. The details of the design and implementation are presented in Sect. 3. Section 4 discusses testing HIPLZ and evaluates the performance of HIPLZ. Section 5 discusses related work.

2 Background

2.1 Heterogeneous-compute Interface for Portability (HIP)

HIP  [8] is a C++ 14 Runtime API and kernel language that is derived from CUDA  [9] and that allows developers to create portable applications for AMD and NVIDIA GPUs from a single source code. It supports advanced C++ programming language features including templates, C++11 lambdas, and many other features.

2.2 Standard Portable Intermediate Representation (SPIR-V) and Fat Binary

SPIR-V [10] is an industry open standard intermediate language (IL) for shader and kernel language compilers used for expressing parallel computation and GPU-based graphics. SPIR-V provides a common language front-end compiler to developers for building computing kernels without needing to directly expose source code. This IL allows shipping compiled kernels in binary format while remaining portable on multiple hardware implementations.

The fat binary model integrates device code (kernel functions) into the host side executable binary via intermediate languages, and uses vendor APIs (driver compiler) to apply just-in-time compilation on kernel functions during runtime. SPIR-V and NVIDIA PTX are typical examples for fat binary.

2.3 OpenCL and HIPCL

OpenCL  [11] is a widely used, open standard for programming heterogeneous platforms, and is supported by most of the major accelerator vendors, including NVIDIA, AMD, Xilinx, ARM, and Intel.

HIPCL  [12] is an open-source compilation and runtime system that allows running HIP programs on OpenCL platforms with sufficient capabilities. HIPCL relies on SPIR-V as a target IL (i.e. fat binary embedded in ELF binary) and implements the HIP API on top of OpenCL calls.

2.4 Level Zero Runtime

Intel Level Zero (L0) [7] is a specification which is part of the Intel oneAPI suite which is a SYCL-based specification and set of APIs and tools targeting CPU, GPU and FPGA devices. The Intel L0 implementation provides a direct-to-metal access to accelerator devices and brings flexibility through the support of a broad set of language features, e.g. unified shared memory, synchronization primitives, and device function pointers. The aim of the L0 API is to provide a system level programming interface that easily allows higher level runtime APIs and libraries to target heterogeneous hardware. This is why we selected it for HIPLZ. The features of the L0 API include, but are not limited to: device partitioning, instrumentation, debugging, power managements, frequency control, and hardware diagnostics. The L0 specification does not define a kernel language, but relies on SPIR-V as an IL.

3 Design and Implementation

3.1 Design Goal

The main design goal of HIPLZ is to connect the Intel L0 runtime to the HIP programming model, thus enabling applications written using HIP to run on GPU devices driven by L0. Based on a survey of HPC application needs, we focused on supporting the following HIP features in HIPLZ: i) streams, including the command execution and callbacks (Sect. 3.4); ii) memory management, including host, device, shared memory, and texture memory (Sect. 3.5); iii) kernel and module management (Sect. 3.6); iv) device management (Sect. 3.7); and v) inter-operation with other parallel programming systems like Intel’s DPC++ (Sect. 3.8). We ended up implementing 133 functions in HIPLZ out of 144 total HIP functions at the time HIPLZ was written. HIP now has 343 functions and the unincluded functions are mainly for graph operations.

Fig. 1.
figure 1

The compilation workflow for HIPLZ.

Fig. 2.
figure 2

The organization of Intel Level Zero runtime.

3.2 The Compilation System

The workflow for the compilation of a HIP program by HIPLZ is shown in Fig. 1. The HIPLZ compilation workflow is based on that of HIPCL, which is a HIP-compatible compiler frontend based on the LLVM/Clang compiler. The HIPLZ compiler translates HIP source code to two parts of LLVM intermediate representation : host IR and device IR. The host part is processed via the legacy LLVM x86 backend to produce an x86 binary, and the device part is processed via the LLVM SPIR-V backend to produce SPIR-V IR. The x86 binary and the SPIR-V IR are then linked together to make an x86 executable binary (or shared library) that is embedded with SPIR-V (a fat binary).

3.3 Runtime System

Before getting into the details of HIP feature support, here we introduce the basic structure of L0. Figure 2 presents the organization of L0 APIs and objects in a top-down manner. On the top level, each Driver interacts with a collection of heterogeneous computing devices that share a given software stack. A physical device is presented as a Device that is associated with a Context that provides an interface for managing memory, modules, synchronization objects, command lists and queues. L0 ’s memory management covers hosts, devices, shared memory, and image samplers.

The L0 API is very similar to OpenCL ’s, especially in terms of the device data abstraction, execution model, and event driven synchronization. However, L0 is at lower level and many features that are available in OpenCL are left to the application developer to implement. Such features include (but are not limited to) reference counting to handle object lifetime, callbacks on events state change, or host kernel enqueuing. HIPLZ wraps L0 data structures in C++ classes in an object-oriented manner, similar to OpenCL ’s C++ bindings.

Table 1 gives some details about the mapping of data structures and similar objects for the different programming models we will use in the next sections HIP, HIPLZ, L0, OpenCL, and SYCL. The HIPLZ compiler translates a HIP object to its corresponding data structure in HIPLZ as in the first two columns in Table 1.

Table 1. The mapping among HIP, HIPLZ, L0, OpenCL, and SYCL objects.

3.4 Streams

A stream in HIP is presented as a sequence of tasks (e.g. kernels, memory copies, events) that execute in FIFO order. The tasks being executed in different streams are allowed to overlap and share device resources. Different streams may execute their commands out of order with respect to one another or concurrently. Three types of streams exist in HIP, the default stream (or NULL stream), blocking streams, and non-blocking streams. The last two types of streams can be created by the application programmer, and each differs in how they synchronize with the default stream. Tasks in the default stream will wait for all tasks previously submitted to blocking streams to be completed before executing. Similarly, tasks in blocking streams will wait for all tasks previously submitted to the default stream to be completed before executing. Non-blocking streams do not synchronize with the default stream.

Fig. 3.
figure 3

a. The basic HIPLZ Command List and Command Queue (image source: https://spec.oneapi.io/level-zero/latest/core/INTRO.html); b. The event order for executing callback

To be able to implement HIP streams with L0, L0 offers two possible modes of execution to dispatch tasks to a device. 1.) a command buffer abstraction (named command list), that will aggregate a series of tasks, and that can later be submitted to a command queue. The driver is free to optimize the execution of the command lists based on the synchronization expressed by the programmer; 2.)a low latency dispatch (named immediate command list) that will execute tasks as soon as they are ready (dependencies met) and able to be executed (available resources).

In HIPLZ, streams are implemented via LZQueue objects that wrap L0 ’s immediate command lists (see Fig. 3(a)). This mode of execution is better suited to implement the FIFO behavior of HIP streams. Synchronization considerations are still important to ensure barriers between tasks within streams as well as to correctly implement the HIP default stream semantics and synchronization. Nonetheless, using the immediate command list greatly reduces the overhead of managing individual command lists that would need to be submitted to command queues and which would need to be freed or recycled once the tasks they contain have finished executing. This technique eliminates the need for dedicated event tracking for each command list, irrespective of synchronization with other streams, and it also reduces the latency between task submission and execution.

The commands executed by the streams include: kernel functions, memory copy operations, host callbacks, and HIP event operations. The synchronization among different streams is supported via L0 events and their wait and signal APIs. The event object in a command list acts as either a barrier or signal, so two tasks running on different streams can use events to synchronize their executions.

Stream Synchronization Example: We use the host callback implementation as an example of how synchronization between and within streams in HIPLZ is implemented with L0. Figure 3(b) presents the workflow of host callback registration and invocation in HIPLZ. The callback function pointer is registered by the callback registration API, and a synchronization scheme is set up to program the callback using L0 events. This implementation of host callbacks requires a three point synchronization scheme. For each callback three L0 events are created, here called reg event, exec event and final event. Three synchronization primitives are added to the L0 immediate command list: a barrier that will signal the reg event once it is reached, a barrier that will wait for the exec event to be signaled by the host, and lastly a signal to final event signifying that the synchronization is complete and that the events can be freed (or recycled). In parallel, the event (host) monitor thread waits on reg event to be signaled, executes the callback, signals callback termination via exec event and waits on final event before releasing the resources.

3.5 Memory Management

HIPLZ supports several HIP memory management APIs, including hipMalloc, hipMemcpy, hipMemcpyAsync, and hipFree. Users can specify the allocation site, i.e. host memory, device memory or shared memory. Shared memory is based on the underlying GPU’s support, and its reference is presented as a raw pointer that can be referred on both the host and device side. As mentioned in Sect. 3.4, in L0 the memory copy operation is implemented as a command that is queued on the command list and is executed via command queue.

HIP texture objects are special memory objects, and their support is similar to as texture objects in CUDA; that is, the texture object is a first-class C++ object and can be passed as an argument just as if it is a pointer. HIPLZ provides hipCreateTextureObject and hipDestroyTextureObject to allocate and free texture objects.

The texture object is composed as an image buffer and a sampler object that operates on an image buffer. Since the image and sampler are defined as separate objects in L0 (i.e. ze_image_t and ze_sampler_t), we create the texture object as a C struct, as shown in Listing 1.1.

The ze_image_t and ze_sampler_t created via the L0 API are raw pointer values, thus they can be stored as intptr_t values. The actual texture operations are performed on reinterpreted structure fields, as shown in lines 6–9 of Listing 1.1, where a 2 dimensional texture of floating point values is sampled at coordinates x and y. This scheme relies on implementation specific behaviors of the Intel driver compiler.

figure a

3.6 Kernel and Module Management

HIP defines three different attributes for functions: __host__, __device__, and __global__. A __host__ decorated function is a function that is to be executed on the host, and functions without decorators will be considered host functions. A __device__ function will be callable from the device, and this decorator can be combined with __host__ to obtain a function that can execute on both the device and the host. A __global__ decorated function or kernel is callable from the host. The HIPLZ compiler translates the kernel and device functions to SPIR-V IL, and they are translated to device binary via vendor compiler during runtime. Each kernel function is wrapped into a LZKernel object and managed by a LZProgram object that presents the L0 module. The kernel launch is based on the L0 API and issues a command to the immediate command list.

HIPLZ also supports device global variables that are used for exchanging values between kernels and host code. Device global variables are supported in SPIR-V, and they can be interacted with from the host using L0. They can also be supported in OpenCL using Intel extensions.

3.7 Device Management

The device management in HIPLZ focuses device selection (hipSetDevice and hipGetDevice) and device property queries (i.e. hipGetDeviceProperties). From L0 standpoint, this means creating a L0 context containing all the devices, and exposing those devices through the hipGetDeviceCount. This allows sharing memory between devices using USM, without needing to register USM allocations between different contexts. Setting the current active device in HIPLZ changes the values for the default devices and default stream. HIP device properties are derived from the different device properties available in L0.

3.8 SYCL Inter-operation

Interoperability between SYCL and HIP helps users maintain large heterogeneous code bases, and it also leverages the advantages of high performance libraries built by vendors (e.g. Intel oneMKL [13]). Both HIPLZ and DPC++ use L0 as the runtime driver for executing kernel functions on Intel GPUs, and use L0 ’s driver object handles to maintain and exchange GPU device information, e.g. to pass an execution context object from HIPLZ to DPC++. To support data exchange, the unified shared memory (USM) mechanism is employed. Both HIPLZ and DPC++ use raw pointers to maintain the reference of the allocated memory from USM, and this simplifies memory reference passing between objects in each execution context.

 https://www.overleaf.com/project/61854547ee0a74d0afa28679

3.9 Kernel Library

The implementation of the HIP math API in HIPLZ is based on OCML [14], which is a thin layer wrapping the OpenCL builtin math functions.

3.10 Discussion

Implementing HIP with L0 comes with some challenges:

Program interface: The L0 API organization is very similar to OpenCL, especially for the objects that abstract the GPU device. However, L0 is a lower level API than OpenCL, as it lacks a kernel language, object lifetime management, and also requires finer grained control on tasks using queues and command lists. This requires careful management of objects lifetime in HIPLZ, and more involved synchronization schemes than in HIPCL.

Capacity of Conversion: Users could benefit from a conversion guide that would describe the potential pitfalls that can arise from migrating to the L0 API from other heterogeneous programming models.

Lack of Thread Safety: There are many runtime objects and APIs that are not thread-safe in the L0 specification, so mutual exclusion is employed for all relevant API call sites in HIPLZ using mutexes.

4 Evaluation

4.1 Employed GPU System

In this study we evaluated HIPLZ on an Intel Gen9 [15] on the JLSE cluster [16]. The Gen9 is an integrated GPU which is available in commercial Intel products such as laptops. Although Intel plans to release high-performance discrete GPUs [2], these are not publicly available at the time of writing, so we focus on the Gen9 GPUs.

The Gen9 GPU has a peak theoretical double (single) precision performance of 331.2 GFlop/s (1324.8 GFlop/s). With 2 channels of DDR4-2133, the peak theoretical DRAM bandwidth is 34.1 GB/s.

4.2 Overview of Tests

To evaluate HIPLZ we collected a repository of HPC-relevant benchmarks, mini-apps, frameworks, and applications hosted on GitHub [17]. The 50 selected codes include 2 benchmarks, 7 mini-apps (2 for BerkeleyGW), 1 application, and 40 HIP examples. The codes are listed in Table 4.

4.3 Results

We first discuss the performance results of the benchmarks and then the overall build/run/pass rate for the tests. For the measurements presented here we used:

  • HIPLZ version: From HIPLZ, branch launch_bounds, commit cbf2260

  • HIPCL version: From a fork of HIPCL, https://github.com/Kerilk/hipcl, in branch fence, commit dd39656

  • OpenCL version: Intel OpenCL 3.0 NEO, driver version 22.02.0

  • hip-test-suite [17], commit 3b19290

We note that HIPLZ has a differently named compiler driver than AMD HIP. HIPLZ uses clang++, while HIP uses hipcc.

Benchmark and Performance Results. To evaluate the performance of the HIPLZ implementation, we consider the tests in the hip-test-suite benchmarks subdirectory. The two tests in this subdirectory (ERT and BabelStream) measure the memory bandwidth and/or the peak performance of the system. The results are summarized in Table 2. By comparing the memory bandwidth and floating point performance, HIPLZ performs similarly to the OpenCL port, near the theoretical peaks of the Gen9 device.

For the memory bandwidth measurements, we expect the code to be able to reach  80% of the theoretical memory bandwidth of the hardware. As shown in Table 2, with the HIPLZ implementation, the HIP BableStream port measures a bandwidth of 27.76 GB/s and the HIP ERT port measures 25.84 GB/s. These are both near 80% of the theoretical bandwidth of the employed hardware.

For the floating point performance measurements, with our HIPLZ implementation, ERT measured 303.22 Gflop/s double precision peak performance, and 1240.69 Gflop/s single precision peak performance. The measured double precision peak performance is about 91% of the theoretical value, and the measured single precision peak performance is about 94% of the theoretical value.

Table 2. Efficiency evaluation of HIPLZ with Comparable APIs.

Details about how this test was compiled and run can be found in Ref. [17].

Several of the tests in the proxies and HIP-Examples subfolders also have HIP and OpenCL ports and measure performance metrics. We also compare several of these performance metrics in Table 3. As shown in Table 3, the performance achieved by HIPLZ on Intel Gen9 GPUs is similar to that achieved by the OpenCL port for additional tests.

Table 3. Performance metrics from additional tests

We also note that although add4 and cuda-stream do not have OpenCL ports in the test suite, they measure memory bandwidth. The bandwidth reported is similar to that reported by the OpenCL and HIP ports of Babelstream in Table 2, so we can consider them achieving the expected performance.

Overall Results. The results are shown in Table 4. Out of 50 tests, 45/50 (90 %) compile without errors, 41/50 (82%) compile and run without crashing, and 38/50 (76%) compile, run to completion, and give the correct answer.

Table 4. Detailed results of building, running, and checking correctness for the tests

Discussion of Results. We now discuss reasons for the failures shown in Table 4. For the tests which did not build, this was due to dependence on external libraries that are not currently supported by HIPLZ (cholla (dependence on hipfft), KokkosDslash (dependence on kokkos)), unimplemented functions (adept-proxy (three-argument shuffles)), and compiler errors (BerkeleyGW-FF, GridMini).

There were four tests that failed at runtime. In three tests, rodinia-backprop, rodinia-dwt2d, and rodinia-b+tree kernel creation failed, and are being investigated. The other test which failed at runtime was gpu-burn, which fails since it allocates a piece of memory larger than what is available on the hardware.

Three of the tests ran but did not give correct answers. BerkeleyGW-GPP, rodinia-heartwall, and RSBench all return output arrays that do not pass validation. Future work will investigate these issues.

5 Related Work

Many of the programming language systems that support GPU offloading translate high-level programming language constructs to heterogeneous programming model APIs. Typical examples are OpenMP [18] and OpenACC [19]. Compilers which support OpenMP or OpenACC translate high-level pragma-based abstractions to lower-level (for example, CUDA driver or OpenCL) calls. This allows code using OpenMP or OpenACC to target a wide variety of hardware as long as the compiler lowers the abstractions into lower-level representations that the underlying runtime can ingest. This representation is bundled in a fat binary-based executable, in which the same binary embeds both host and device code. This allows the device code to be recompiled or optimized when the driver is updated, without having to rebuild the application. The usage of fat binaries brings the advantage for application deployment, i.e. no need to maintain separated binary or source code (host and device) and link them together for execution. LLVM/Clang [20] uses PTX as the intermediate language (IL) for the CUDA driver. Intel OpenMP compiler makes another choice and uses SPIR-V as IL in order to target their OpenCL or L0 based GPU backends [21, 22]. The approach in HIPLZ is similar, although we implement the HIP API and not pragma-based approaches, and we use SPIR-V as the intermediary representation.

Different approaches exist to bridge programming models to L0: for example ZLUDA [23] is a demonstrator showcasing running unmodified CUDA applications on top of L0 by implementing the CUDA driver API in L0, and converting NVIDIA PTX [24] to SPIR-V. ZLUDA only supports a limited subset of applications, but it does showcase promising performance on those applications.

Another well known project bridging several programming models to OpenCL is pocl [25]. pocl implements OpenCL for NVIDIA GPUs on top of CUDA, AMD GPUs on top of HSA and supports CPU devices as well through the Posix Threads programming API.

hipSYCL [26] is a SYCL implementation that leverages existing heterogeneous programming model such as CUDA, HIP to support different GPU architectures. It also provides a work-in-progress support for Intel GPUs via oneAPI [21].

6 Conclusion

In this paper, we introduced the design and implementation of HIPLZ, a compilation and runtime system that allows HIP code to run on Intel GPUs. It uses the L0 API to implement the HIP API’s functionalities and SPIR-V as the IL to represent the kernel functions. To the best of our knowledge, HIPLZ is the first compiler and runtime system that allows HIP code to run on Intel GPUs by using L0.

HIPLZ successfully compiled and produced correct results on an Intel Gen9 GPU for more than 35 HIP test cases and mini-apps. In terms of performance, we ran two performance benchmarks using HIPLZ and were able to achieve approximately the same peak values as OpenCL, demonstrating that HIPLZ produces code that can effectively use the Intel GPU hardware. Future work will focus on extending performance for more applications and interoperability with other programming models like DPC++ and OpenMP.