HIPLZ: Enabling Performance Portability for Exascale Systems

Zhao, Jisheng; Bertoni, Colleen; Young, Jeffrey; Harms, Kevin; Sarkar, Vivek; Videau, Brice

doi:10.1007/978-3-031-31209-0_15

Jisheng Zhao¹³,
Colleen Bertoni¹⁴,
Jeffrey Young¹³,
Kevin Harms¹⁴,
Vivek Sarkar¹³ &
…
Brice Videau¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13835))

Included in the following conference series:

European Conference on Parallel Processing

485 Accesses

Abstract

While heterogeneous computing has emerged as a dominant trend in current and future High-Performance Computing (HPC) systems, it is also widely recognized that this shift has led to increased software complexity due to a proliferation of programming systems for different heterogeneous processors. One such example is the Heterogeneous-Computing Interface for Portability from AMD (HIP), which is composed of a C Runtime API and C++ Kernel Language. Many HPC applications will likely use HIP on future exascale systems (e.g., Frontier and El Capitan), but HIP currently only targets AMD and NVIDIA processors. This limitation creates challenges for users who would also like to run their applications on exascale systems based on other architectures (e.g., Aurora, which is based on Intel hardware) that are currently not targeted by HIP.

In this paper, we introduce the design and implementation of HIPLZ, a compiler and runtime system that uses the Intel Level Zero API to support HIP on Intel GPU architectures. We discuss the design of HIPLZ, derived from HIPCL (an implementation of HIP on top of OpenCL), and portability issues that occur from using the Level Zero runtime as a backend. We evaluate our implementation by running several performance benchmarks and mini-apps written in HIP on Intel architectures using HIPLZ. Our results show that this approach provides competitive performance relative to Intel’s OpenCL implementations on Intel Gen9 GPUs, while providing good coverage of features needed by HPC applications. Overall, this approach is a promising demonstration of enabling performance portability for exascale systems.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Portability and Scalability of OpenMP Offloading on State-of-the-Art Accelerators

A Case Study for Performance Portability Using OpenMP 4.5

From Describing to Prescribing Parallelism: Translating the SPEC ACCEL OpenACC Suite to OpenMP Target Directives

1 Introduction

Modern High Performance Computing (HPC) has been defined as an era of extreme heterogeneity where an increasing number of accelerators support SIMD parallelism, spatial computing, or domain specific architectures. This is especially true as we move toward exascale, where the majority of pre-exascale and exascale systems are accelerator-based. For example, 7 of the top 10 systems in the Top 500 for November 2021 are GPU-based [1]. Recently, NVIDIA systems were the dominant accelerator which applications would target, but several next-generation systems will be based on accelerators from different vendors: Aurora and SuperMUC-NG Phase II, with Intel GPUs [2, 3] and Frontier, El Capitan, and LUMI with AMD GPUs [4,5,6]. Each vendor generally develops its own programming model and implementation which is optimized for its hardware. This design poses a challenge for application developers who wish to create portable code for multiple systems. Often this programming model heterogeneity results in application developers maintaining multiple branches of code in each different vendor-specific programming model, which increases code complexity and developer time requirements.

Heterogeneous-compute Interface for Portability (HIP) from AMD is one example of such a programming system that targets AMD and NVIDIA architectures. In this paper, we introduce HIPLZ: a compilation and runtime system that supports HIP via Intel’s Level Zero (L0) runtime [7] using the fat binary model for supporting multiple architectures and SPIR-V as an intermediate language (IL). To the best of our knowledge, HIPLZ is the first effort that bridges HIP to L0 which is the primary low level application programming interface (API) for Intel hardware.

In thispaper, we present the following contributions:

1.
The prototype of HIPLZ, a library that allows applications using the HIP API to run on devices that support Intel Level Zero and OpenCL drivers. The source code is located at: https://github.com/jz10/anl-gt-gpu.
2.
A test suite that covers the major functionality of HIP and that uses it as the validation of HIPLZ.
3.
An evaluation of test coverage and code performance of HIPLZ on Intel Gen 9 GPUs. Our results show that HIPLZ supports the complete execution of 82% of tested applications and demonstrates performance parity with HIPCL and OpenCL for memory- and FLOP-focused benchmarks.

The paper is organized as follows: Sect. 2 gives background information about the HIP programming model, intermediate representation and the Intel L0 runtime. The details of the design and implementation are presented in Sect. 3. Section 4 discusses testing HIPLZ and evaluates the performance of HIPLZ. Section 5 discusses related work.

2 Background

2.1 Heterogeneous-compute Interface for Portability (HIP)

HIP [8] is a C++ 14 Runtime API and kernel language that is derived from CUDA [9] and that allows developers to create portable applications for AMD and NVIDIA GPUs from a single source code. It supports advanced C++ programming language features including templates, C++11 lambdas, and many other features.

2.2 Standard Portable Intermediate Representation (SPIR-V) and Fat Binary

SPIR-V [10] is an industry open standard intermediate language (IL) for shader and kernel language compilers used for expressing parallel computation and GPU-based graphics. SPIR-V provides a common language front-end compiler to developers for building computing kernels without needing to directly expose source code. This IL allows shipping compiled kernels in binary format while remaining portable on multiple hardware implementations.

The fat binary model integrates device code (kernel functions) into the host side executable binary via intermediate languages, and uses vendor APIs (driver compiler) to apply just-in-time compilation on kernel functions during runtime. SPIR-V and NVIDIA PTX are typical examples for fat binary.

2.3 OpenCL and HIPCL

OpenCL [11] is a widely used, open standard for programming heterogeneous platforms, and is supported by most of the major accelerator vendors, including NVIDIA, AMD, Xilinx, ARM, and Intel.

HIPCL [12] is an open-source compilation and runtime system that allows running HIP programs on OpenCL platforms with sufficient capabilities. HIPCL relies on SPIR-V as a target IL (i.e. fat binary embedded in ELF binary) and implements the HIP API on top of OpenCL calls.

2.4 Level Zero Runtime

Intel Level Zero (L0) [7] is a specification which is part of the Intel oneAPI suite which is a SYCL-based specification and set of APIs and tools targeting CPU, GPU and FPGA devices. The Intel L0 implementation provides a direct-to-metal access to accelerator devices and brings flexibility through the support of a broad set of language features, e.g. unified shared memory, synchronization primitives, and device function pointers. The aim of the L0 API is to provide a system level programming interface that easily allows higher level runtime APIs and libraries to target heterogeneous hardware. This is why we selected it for HIPLZ. The features of the L0 API include, but are not limited to: device partitioning, instrumentation, debugging, power managements, frequency control, and hardware diagnostics. The L0 specification does not define a kernel language, but relies on SPIR-V as an IL.

3 Design and Implementation

3.1 Design Goal

The main design goal of HIPLZ is to connect the Intel L0 runtime to the HIP programming model, thus enabling applications written using HIP to run on GPU devices driven by L0. Based on a survey of HPC application needs, we focused on supporting the following HIP features in HIPLZ: i) streams, including the command execution and callbacks (Sect. 3.4); ii) memory management, including host, device, shared memory, and texture memory (Sect. 3.5); iii) kernel and module management (Sect. 3.6); iv) device management (Sect. 3.7); and v) inter-operation with other parallel programming systems like Intel’s DPC++ (Sect. 3.8). We ended up implementing 133 functions in HIPLZ out of 144 total HIP functions at the time HIPLZ was written. HIP now has 343 functions and the unincluded functions are mainly for graph operations.

3.2 The Compilation System

The workflow for the compilation of a HIP program by HIPLZ is shown in Fig. 1. The HIPLZ compilation workflow is based on that of HIPCL, which is a HIP-compatible compiler frontend based on the LLVM/Clang compiler. The HIPLZ compiler translates HIP source code to two parts of LLVM intermediate representation : host IR and device IR. The host part is processed via the legacy LLVM x86 backend to produce an x86 binary, and the device part is processed via the LLVM SPIR-V backend to produce SPIR-V IR. The x86 binary and the SPIR-V IR are then linked together to make an x86 executable binary (or shared library) that is embedded with SPIR-V (a fat binary).

3.3 Runtime System

Before getting into the details of HIP feature support, here we introduce the basic structure of L0. Figure 2 presents the organization of L0 APIs and objects in a top-down manner. On the top level, each Driver interacts with a collection of heterogeneous computing devices that share a given software stack. A physical device is presented as a Device that is associated with a Context that provides an interface for managing memory, modules, synchronization objects, command lists and queues. L0 ’s memory management covers hosts, devices, shared memory, and image samplers.

The L0 API is very similar to OpenCL ’s, especially in terms of the device data abstraction, execution model, and event driven synchronization. However, L0 is at lower level and many features that are available in OpenCL are left to the application developer to implement. Such features include (but are not limited to) reference counting to handle object lifetime, callbacks on events state change, or host kernel enqueuing. HIPLZ wraps L0 data structures in C++ classes in an object-oriented manner, similar to OpenCL ’s C++ bindings.

Table 1 gives some details about the mapping of data structures and similar objects for the different programming models we will use in the next sections HIP, HIPLZ, L0, OpenCL, and SYCL. The HIPLZ compiler translates a HIP object to its corresponding data structure in HIPLZ as in the first two columns in Table 1.

Table 1. The mapping among HIP, HIPLZ, L0, OpenCL, and SYCL objects.

Full size table

3.4 Streams

A stream in HIP is presented as a sequence of tasks (e.g. kernels, memory copies, events) that execute in FIFO order. The tasks being executed in different streams are allowed to overlap and share device resources. Different streams may execute their commands out of order with respect to one another or concurrently. Three types of streams exist in HIP, the default stream (or NULL stream), blocking streams, and non-blocking streams. The last two types of streams can be created by the application programmer, and each differs in how they synchronize with the default stream. Tasks in the default stream will wait for all tasks previously submitted to blocking streams to be completed before executing. Similarly, tasks in blocking streams will wait for all tasks previously submitted to the default stream to be completed before executing. Non-blocking streams do not synchronize with the default stream.

To be able to implement HIP streams with L0, L0 offers two possible modes of execution to dispatch tasks to a device. 1.) a command buffer abstraction (named command list), that will aggregate a series of tasks, and that can later be submitted to a command queue. The driver is free to optimize the execution of the command lists based on the synchronization expressed by the programmer; 2.)a low latency dispatch (named immediate command list) that will execute tasks as soon as they are ready (dependencies met) and able to be executed (available resources).

In HIPLZ, streams are implemented via LZQueue objects that wrap L0 ’s immediate command lists (see Fig. 3(a)). This mode of execution is better suited to implement the FIFO behavior of HIP streams. Synchronization considerations are still important to ensure barriers between tasks within streams as well as to correctly implement the HIP default stream semantics and synchronization. Nonetheless, using the immediate command list greatly reduces the overhead of managing individual command lists that would need to be submitted to command queues and which would need to be freed or recycled once the tasks they contain have finished executing. This technique eliminates the need for dedicated event tracking for each command list, irrespective of synchronization with other streams, and it also reduces the latency between task submission and execution.

The commands executed by the streams include: kernel functions, memory copy operations, host callbacks, and HIP event operations. The synchronization among different streams is supported via L0 events and their wait and signal APIs. The event object in a command list acts as either a barrier or signal, so two tasks running on different streams can use events to synchronize their executions.

Stream Synchronization Example: We use the host callback implementation as an example of how synchronization between and within streams in HIPLZ is implemented with L0. Figure 3(b) presents the workflow of host callback registration and invocation in HIPLZ. The callback function pointer is registered by the callback registration API, and a synchronization scheme is set up to program the callback using L0 events. This implementation of host callbacks requires a three point synchronization scheme. For each callback three L0 events are created, here called reg event, exec event and final event. Three synchronization primitives are added to the L0 immediate command list: a barrier that will signal the reg event once it is reached, a barrier that will wait for the exec event to be signaled by the host, and lastly a signal to final event signifying that the synchronization is complete and that the events can be freed (or recycled). In parallel, the event (host) monitor thread waits on reg event to be signaled, executes the callback, signals callback termination via exec event and waits on final event before releasing the resources.

3.5 Memory Management

HIPLZ supports several HIP memory management APIs, including hipMalloc, hipMemcpy, hipMemcpyAsync, and hipFree. Users can specify the allocation site, i.e. host memory, device memory or shared memory. Shared memory is based on the underlying GPU’s support, and its reference is presented as a raw pointer that can be referred on both the host and device side. As mentioned in Sect. 3.4, in L0 the memory copy operation is implemented as a command that is queued on the command list and is executed via command queue.

HIP texture objects are special memory objects, and their support is similar to as texture objects in CUDA; that is, the texture object is a first-class C++ object and can be passed as an argument just as if it is a pointer. HIPLZ provides hipCreateTextureObject and hipDestroyTextureObject to allocate and free texture objects.

The texture object is composed as an image buffer and a sampler object that operates on an image buffer. Since the image and sampler are defined as separate objects in L0 (i.e. ze_image_t and ze_sampler_t), we create the texture object as a C struct, as shown in Listing 1.1.

The ze_image_t and ze_sampler_t created via the L0 API are raw pointer values, thus they can be stored as intptr_t values. The actual texture operations are performed on reinterpreted structure fields, as shown in lines 6–9 of Listing 1.1, where a 2 dimensional texture of floating point values is sampled at coordinates x and y. This scheme relies on implementation specific behaviors of the Intel driver compiler.

3.6 Kernel and Module Management

HIP defines three different attributes for functions: __host__, __device__, and __global__. A __host__ decorated function is a function that is to be executed on the host, and functions without decorators will be considered host functions. A __device__ function will be callable from the device, and this decorator can be combined with __host__ to obtain a function that can execute on both the device and the host. A __global__ decorated function or kernel is callable from the host. The HIPLZ compiler translates the kernel and device functions to SPIR-V IL, and they are translated to device binary via vendor compiler during runtime. Each kernel function is wrapped into a LZKernel object and managed by a LZProgram object that presents the L0 module. The kernel launch is based on the L0 API and issues a command to the immediate command list.

HIPLZ also supports device global variables that are used for exchanging values between kernels and host code. Device global variables are supported in SPIR-V, and they can be interacted with from the host using L0. They can also be supported in OpenCL using Intel extensions.

3.7 Device Management

The device management in HIPLZ focuses device selection (hipSetDevice and hipGetDevice) and device property queries (i.e. hipGetDeviceProperties). From L0 standpoint, this means creating a L0 context containing all the devices, and exposing those devices through the hipGetDeviceCount. This allows sharing memory between devices using USM, without needing to register USM allocations between different contexts. Setting the current active device in HIPLZ changes the values for the default devices and default stream. HIP device properties are derived from the different device properties available in L0.

3.8 SYCL Inter-operation

Interoperability between SYCL and HIP helps users maintain large heterogeneous code bases, and it also leverages the advantages of high performance libraries built by vendors (e.g. Intel oneMKL [13]). Both HIPLZ and DPC++ use L0 as the runtime driver for executing kernel functions on Intel GPUs, and use L0 ’s driver object handles to maintain and exchange GPU device information, e.g. to pass an execution context object from HIPLZ to DPC++. To support data exchange, the unified shared memory (USM) mechanism is employed. Both HIPLZ and DPC++ use raw pointers to maintain the reference of the allocated memory from USM, and this simplifies memory reference passing between objects in each execution context.

https://www.overleaf.com/project/61854547ee0a74d0afa28679

3.9 Kernel Library

The implementation of the HIP math API in HIPLZ is based on OCML [14], which is a thin layer wrapping the OpenCL builtin math functions.

3.10 Discussion

Implementing HIP with L0 comes with some challenges:

Program interface: The L0 API organization is very similar to OpenCL, especially for the objects that abstract the GPU device. However, L0 is a lower level API than OpenCL, as it lacks a kernel language, object lifetime management, and also requires finer grained control on tasks using queues and command lists. This requires careful management of objects lifetime in HIPLZ, and more involved synchronization schemes than in HIPCL.

Capacity of Conversion: Users could benefit from a conversion guide that would describe the potential pitfalls that can arise from migrating to the L0 API from other heterogeneous programming models.

Lack of Thread Safety: There are many runtime objects and APIs that are not thread-safe in the L0 specification, so mutual exclusion is employed for all relevant API call sites in HIPLZ using mutexes.

4 Evaluation

4.1 Employed GPU System

In this study we evaluated HIPLZ on an Intel Gen9 [15] on the JLSE cluster [16]. The Gen9 is an integrated GPU which is available in commercial Intel products such as laptops. Although Intel plans to release high-performance discrete GPUs [2], these are not publicly available at the time of writing, so we focus on the Gen9 GPUs.

The Gen9 GPU has a peak theoretical double (single) precision performance of 331.2 GFlop/s (1324.8 GFlop/s). With 2 channels of DDR4-2133, the peak theoretical DRAM bandwidth is 34.1 GB/s.

4.2 Overview of Tests

To evaluate HIPLZ we collected a repository of HPC-relevant benchmarks, mini-apps, frameworks, and applications hosted on GitHub [17]. The 50 selected codes include 2 benchmarks, 7 mini-apps (2 for BerkeleyGW), 1 application, and 40 HIP examples. The codes are listed in Table 4.

4.3 Results

We first discuss the performance results of the benchmarks and then the overall build/run/pass rate for the tests. For the measurements presented here we used:

HIPLZ version: From HIPLZ, branch launch_bounds, commit cbf2260
HIPCL version: From a fork of HIPCL, https://github.com/Kerilk/hipcl, in branch fence, commit dd39656
OpenCL version: Intel OpenCL 3.0 NEO, driver version 22.02.0
hip-test-suite [17], commit 3b19290

We note that HIPLZ has a differently named compiler driver than AMD HIP. HIPLZ uses clang++, while HIP uses hipcc.

Benchmark and Performance Results. To evaluate the performance of the HIPLZ implementation, we consider the tests in the hip-test-suite benchmarks subdirectory. The two tests in this subdirectory (ERT and BabelStream) measure the memory bandwidth and/or the peak performance of the system. The results are summarized in Table 2. By comparing the memory bandwidth and floating point performance, HIPLZ performs similarly to the OpenCL port, near the theoretical peaks of the Gen9 device.

For the memory bandwidth measurements, we expect the code to be able to reach 80% of the theoretical memory bandwidth of the hardware. As shown in Table 2, with the HIPLZ implementation, the HIP BableStream port measures a bandwidth of 27.76 GB/s and the HIP ERT port measures 25.84 GB/s. These are both near 80% of the theoretical bandwidth of the employed hardware.

For the floating point performance measurements, with our HIPLZ implementation, ERT measured 303.22 Gflop/s double precision peak performance, and 1240.69 Gflop/s single precision peak performance. The measured double precision peak performance is about 91% of the theoretical value, and the measured single precision peak performance is about 94% of the theoretical value.

Table 2. Efficiency evaluation of HIPLZ with Comparable APIs.

Full size table

Details about how this test was compiled and run can be found in Ref. [17].

Several of the tests in the proxies and HIP-Examples subfolders also have HIP and OpenCL ports and measure performance metrics. We also compare several of these performance metrics in Table 3. As shown in Table 3, the performance achieved by HIPLZ on Intel Gen9 GPUs is similar to that achieved by the OpenCL port for additional tests.

Table 3. Performance metrics from additional tests

Full size table

We also note that although add4 and cuda-stream do not have OpenCL ports in the test suite, they measure memory bandwidth. The bandwidth reported is similar to that reported by the OpenCL and HIP ports of Babelstream in Table 2, so we can consider them achieving the expected performance.

Overall Results. The results are shown in Table 4. Out of 50 tests, 45/50 (90 %) compile without errors, 41/50 (82%) compile and run without crashing, and 38/50 (76%) compile, run to completion, and give the correct answer.

Table 4. Detailed results of building, running, and checking correctness for the tests

Full size table

Discussion of Results. We now discuss reasons for the failures shown in Table 4. For the tests which did not build, this was due to dependence on external libraries that are not currently supported by HIPLZ (cholla (dependence on hipfft), KokkosDslash (dependence on kokkos)), unimplemented functions (adept-proxy (three-argument shuffles)), and compiler errors (BerkeleyGW-FF, GridMini).

There were four tests that failed at runtime. In three tests, rodinia-backprop, rodinia-dwt2d, and rodinia-b+tree kernel creation failed, and are being investigated. The other test which failed at runtime was gpu-burn, which fails since it allocates a piece of memory larger than what is available on the hardware.

Three of the tests ran but did not give correct answers. BerkeleyGW-GPP, rodinia-heartwall, and RSBench all return output arrays that do not pass validation. Future work will investigate these issues.

5 Related Work

Many of the programming language systems that support GPU offloading translate high-level programming language constructs to heterogeneous programming model APIs. Typical examples are OpenMP [18] and OpenACC [19]. Compilers which support OpenMP or OpenACC translate high-level pragma-based abstractions to lower-level (for example, CUDA driver or OpenCL) calls. This allows code using OpenMP or OpenACC to target a wide variety of hardware as long as the compiler lowers the abstractions into lower-level representations that the underlying runtime can ingest. This representation is bundled in a fat binary-based executable, in which the same binary embeds both host and device code. This allows the device code to be recompiled or optimized when the driver is updated, without having to rebuild the application. The usage of fat binaries brings the advantage for application deployment, i.e. no need to maintain separated binary or source code (host and device) and link them together for execution. LLVM/Clang [20] uses PTX as the intermediate language (IL) for the CUDA driver. Intel OpenMP compiler makes another choice and uses SPIR-V as IL in order to target their OpenCL or L0 based GPU backends [21, 22]. The approach in HIPLZ is similar, although we implement the HIP API and not pragma-based approaches, and we use SPIR-V as the intermediary representation.

Different approaches exist to bridge programming models to L0: for example ZLUDA [23] is a demonstrator showcasing running unmodified CUDA applications on top of L0 by implementing the CUDA driver API in L0, and converting NVIDIA PTX [24] to SPIR-V. ZLUDA only supports a limited subset of applications, but it does showcase promising performance on those applications.

Another well known project bridging several programming models to OpenCL is pocl [25]. pocl implements OpenCL for NVIDIA GPUs on top of CUDA, AMD GPUs on top of HSA and supports CPU devices as well through the Posix Threads programming API.

hipSYCL [26] is a SYCL implementation that leverages existing heterogeneous programming model such as CUDA, HIP to support different GPU architectures. It also provides a work-in-progress support for Intel GPUs via oneAPI [21].

6 Conclusion

In this paper, we introduced the design and implementation of HIPLZ, a compilation and runtime system that allows HIP code to run on Intel GPUs. It uses the L0 API to implement the HIP API’s functionalities and SPIR-V as the IL to represent the kernel functions. To the best of our knowledge, HIPLZ is the first compiler and runtime system that allows HIP code to run on Intel GPUs by using L0.

HIPLZ successfully compiled and produced correct results on an Intel Gen9 GPU for more than 35 HIP test cases and mini-apps. In terms of performance, we ran two performance benchmarks using HIPLZ and were able to achieve approximately the same peak values as OpenCL, demonstrating that HIPLZ produces code that can effectively use the Intel GPU hardware. Future work will focus on extending performance for more applications and interoperability with other programming models like DPC++ and OpenMP.

References

Top 500 List. https://www.top500.org/lists/top500/list/2021/11/
Aurora. https://www.alcf.anl.gov/aurora
SuperMUC-NG. https://www.hpcwire.com/2021/05/05/lrz-announces-new-phase-of-supermuc-ng-supercomputer-with-intels-ponte-vecchio-gpu/
Frontier. https://www.olcf.ornl.gov/frontier/
EL Captain Supercomputer. https://www.hpe.com/us/en/compute/hpc/cray/doe-el-capitan-press-release.html
LUMI Supercomputer. hhttps://www.csc.fi/en/-/lumi-one-of-the-worlds-mightiest-supercomputers
Intel Level Zero Spec. http://spec.oneapi.io/level-zero/latest/index.html
Heterogeneous compute Interface for Portability (HIP). https://rocmdocs.amd.com/en/latest/Programming_Guides/Programming-Guides.html
Nvidia CUDA Toolkit. https://developer.nvidia.com/cuda-toolkit
The Industry Open Standard Intermediate Language for Parallel Compute and Graphics. https://www.khronos.org/spir/
Open Standard for Parallel Programming of Heterogeneous Systems(OpenCL). https://www.khronos.org/opencl/
Michal Babej and Pekka Jääskeläinen. Hipcl: Tool for porting cuda applications to advanced opencl platforms through hip. IWOCL ’20, 2020
Google Scholar
oneMKL. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
OCML. https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/amd-stg-open/doc/OCML.md
The Compute Architecture of Intel Processor Graphics Gen9. https://software.intel.com/content/dam/develop/external/us/en/documents/the-compute-architecture-of-intel-processor-graphics-gen9-v1d0.pdf
Argonne Joint Lab for System Evaluation (JLSE). https://www.jlse.anl.gov
HIP Test Set. https://github.com/jz10/hip-test_suite
OpenMP. https://www.openmp.org/
OpenACC. https://www.openacc.org/
Clang Compiler. http://clang.llvm.org/
Intel oneAPI. https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html
DPC++. https://www.intel.com/content/www/us/en/developer/tools/oneapi/data-parallel-c-plus-plus.html
ZLUDA: CUDA on Intel GPUs. https://github.com/vosen/ZLUDA
Parallel Thread Execution (PTX) and ISA. https://docs.nvidia.com/cuda/parallel-thread-execution/
Jääskeläinen, P., et al.: Pocl: A performance-portable opencl implementation. Int. J. Parallel Program. 43(5), 752–785 (2015)
Article Google Scholar
hipsycl. https://hipsycl.github.io/

Download references

Acknowledgements

This work was supported by the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357, and by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration). We also gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory.

Author information

Authors and Affiliations

Georgia Institute of Technology, Atlanta, GA, USA
Jisheng Zhao, Jeffrey Young & Vivek Sarkar
Argonne National Laboratory, Lemont, IL, USA
Colleen Bertoni, Kevin Harms & Brice Videau

Authors

Jisheng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Colleen Bertoni
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Young
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Harms
View author publications
You can also search for this author in PubMed Google Scholar
Vivek Sarkar
View author publications
You can also search for this author in PubMed Google Scholar
Brice Videau
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jisheng Zhao .

Editor information

Editors and Affiliations

University of Glasgow, Glasgow, UK
Jeremy Singer
University of Glasgow, Glasgow, UK
Yehia Elkhatib
University of Santiago de Compostela, Santiago de Compostela, La Coruña, Spain
Dora Blanco Heras
Louisiana State University, Baton Rouge, LA, USA
Patrick Diehl
University of Edinburgh, Edinburgh, UK
Nick Brown
Universidade de Lisboa, Lisbon, Portugal
Aleksandar Ilic

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, J., Bertoni, C., Young, J., Harms, K., Sarkar, V., Videau, B. (2023). HIPLZ: Enabling Performance Portability for Exascale Systems. In: Singer, J., Elkhatib, Y., Blanco Heras, D., Diehl, P., Brown, N., Ilic, A. (eds) Euro-Par 2022: Parallel Processing Workshops. Euro-Par 2022. Lecture Notes in Computer Science, vol 13835. Springer, Cham. https://doi.org/10.1007/978-3-031-31209-0_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-31209-0_15
Published: 02 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31208-3
Online ISBN: 978-3-031-31209-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

HIPLZ: Enabling Performance Portability for Exascale Systems

Abstract

Similar content being viewed by others

Portability and Scalability of OpenMP Offloading on State-of-the-Art Accelerators

A Case Study for Performance Portability Using OpenMP 4.5

From Describing to Prescribing Parallelism: Translating the SPEC ACCEL OpenACC Suite to OpenMP Target Directives

1 Introduction

2 Background

2.1 Heterogeneous-compute Interface for Portability (HIP)

2.2 Standard Portable Intermediate Representation (SPIR-V) and Fat Binary

2.3 OpenCL and HIPCL

2.4 Level Zero Runtime

3 Design and Implementation

3.1 Design Goal

3.2 The Compilation System

3.3 Runtime System

3.4 Streams

3.5 Memory Management

3.6 Kernel and Module Management

3.7 Device Management

3.8 SYCL Inter-operation

3.9 Kernel Library

3.10 Discussion

4 Evaluation

4.1 Employed GPU System

4.2 Overview of Tests

4.3 Results

5 Related Work

6 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation