Keywords

1 Introduction

This paper describes performance portability on different heterogeneous systems using the IRIS programming modelFootnote 1 for LU (lower-upper) matrix factorization. Iris is a task + dependency-based programming model in which each task can encapsulate almost any kind of current parallel code (e.g., OpenMP, CUDA, HIP, OpenACC) and targets almost any current parallel computer architecture (e.g., CPUs, graphics processing units [GPUs], digital signal processors [DSPs], field-programmable gate arrays [FPGAs]).

We use LU factorization as a motivating case study given its importance in multiple high-performance computing (HPC) applications [1,2,3], but the ideas explored in this paper can also be effectively applied to other HPC applications. LU factorization is also one of the most important benchmarks [4, 5] used to evaluate the performance of HPC systems.Footnote 2 Parallel LU factorization is composed of four major and completely different operations that must be computed on blocks of different shapes and sizes, and the size of these blocks are different along the computation. All these factors make LU factorization a challenging case study for performance portability on heterogeneous systems—for which the best target architecture of each application component is unclear.

To make performance portability on heterogeneous systems simpler, we separate the algorithm design from the tuning. While the algorithm is described by using tasks + dependencies on top of IRIS, the tuning consists of choosing the target/code for each of the tasks, which enables us to use one code for multiple platforms.

The rest of the paper is organized as follows: Sect. 2 presents the main characteristics of the IRIS programming model, Sect. 3 introduces the LU factorization case study, and Sect. 4 outlines the effort to implementation a portable and heterogeneous LU code using IRIS. The performance study is described in Sect. 5. Finally, related work is summarized in Sect. 6, and future directions and conclusions are presented in Sect. 7.

Fig. 1.
figure 1

The IRIS architecture.

2 IRIS Programming System

As a programming system for extremely heterogeneous architectures, IRIS [6] enables application developers to write portable applications across diverse heterogeneous programming platforms, including CUDA, HIP, Level Zero, OpenCL, and OpenMP (Fig. 1). IRIS orchestrates multiple programming platforms into a single execution/programming environment by providing portable tasks and shared virtual device memory.

IRIS provides a task-based programming model in which a task is a scheduling unit. A task runs on a single device but is portable across any compute device in a system. A task can contain zero or more commands, and there are four types of commands: (1) host-to-device memory copy, (2) device-to-host memory copy, (3) kernel launch, and (4) host. Because a task can have a dependency on other tasks, it cannot start until the prerequisite tasks complete. Therefore, writing an IRIS application means building directed acyclic graphs of tasks. Each task has a target device selection policy when it is submitted. This policy is specified by the programmer, and it can be a device number, device type (e.g., CPU, GPU, FPGA, DSP), or a built-in policy provided by IRIS (e.g., greedy, random, locality-aware, profile).

To achieve application portability and flexible task scheduling with effective data orchestration, IRIS provides shared virtual device memory across multiple, disjointed physical device memories. IRIS automatically transfers data across multiple devices to keep memory consistency across tasks. Therefore, all compute devices can share memory objects in the shared virtual device memory, and they can see the same content in the memory objects.

3 LU Factorization

Decomposing a matrix A into lower and upper triangular matrices (i.e., the LU factorization) is used to more easily solve systems of linear equations:

$$\begin{aligned} Ax = LUx = B. \end{aligned}$$
(1)

LU factorization plays a key role in many computational science applications. However, it is also computationally expensive, which motivated us to develop a new LU factorization implementation on top of the IRIS programming model to provide performance portability on different modern heterogeneous systems.

One of the most common ways to parallelize this type of operation is to decompose the matrix into tiles by defining the dependencies between the tiles and the operations to be computed on each tile. This can be accomplished through tasking [7,8,9].

The LU factorization on a tiled matrix (Fig. 2) consists of (1) factorizing the first tile of the diagonal to obtain the L (dark-green) and U (light-green) matrices of the tile; (2) computing several TRSMs (light-blue) by using the L matrix for the corresponding row and the U matrix for the corresponding column; and (3) computing the so-called update step (dark-blue) by multiplying (i.e., general matrix multiply [GEMM]) the result of the set of TRSMs and updating the tiles in the rest of the matrix. We compute the next tile of the diagonal and the next two steps until the entire matrix is computed.

Fig. 2.
figure 2

LU decomposition.

Although the state-of-the-art routine for LU factorization involves pivoting, we developed a non-pivoting version for two reasons: (1) the pivoting is not necessary on well-conditioned matrices, and (2) we want to analyze the performance of the proposed optimizations without the influence of pivoting for the sake of performance analysis. Additionally, although using pivoting to solve linear systems of equations is commonly accepted, we found multiple problems in which the matrices were well conditioned, which made expensive operations such as pivoting unnecessary. For this reason, multiple implementations in reference libraries do not use such a technique. Examples include PLASMA [9], LASs [7], Intel’s MKL,Footnote 3 NVIDIA’s cuSolver [10] and cuSparse [11], FISHPACK [12, 13], and SuperLU [14].

4 Implementation

Figure 3 shows the pseudocode for the IRIS-implemented LU factorization. At this algorithm level, we declare the different memory spaces and tasks and dependencies among them and describe the algorithm to be computed. As shown, every task can be computed on a CPU, a GPU, or both depending on the optimizations and ideas we want to explore. These optimizations do not require code modifications at the algorithm level, but they are conducted internally in each of the tasks at the implementation level. Although the algorithm is described/implemented at the algorithm level in an architecture-agnostic way, the implementation level (set of tasks) attempts to obtain the maximum performance on the target architecture. One of the benefits of using IRIS is that one algorithm-level code can have multiple and different implementation levels with each optimized for a specific heterogeneous platform.

Fig. 3.
figure 3

LU factorization code using IRIS.

The implementation of our LU factorization consists of four different tasks (Fig. 4): (1) GETRF, in which we compute a no-pivoting LU factorization on the top-left corner matrix TILE; (2) TRSM-top, in which we compute the level-3 BLAS TRSM routine by using the lower side of the LU factorization computed in the previous task as the input matrix (A in Fig. 4) and the rectangular tile located at the right of the LU matrix as the output matrix (B in Fig. 4); (3) TRSM-left, in which we compute the same level-3 BLAS operation used in the previous task but on a different part of the matrix by using the upper side of the LU matrix computed by the first task (GETRF) as input and a set of square tiles located under the lower side of the LU matrix as output (\(B_0, B_1, B_2\), and \(B_3\) in Fig. 4); and (4) GEMM, in which we compute a matrix-matrix multiplication by using the output of the two previous tasks as input and the remaining matrix parts as output. We compute all the previous tasks until the entire matrix is computed (Figs. 2 and 4 [left]).

Fig. 4.
figure 4

Tasks of the LU decomposition implementation.

4.1 Memory Management

Our goal is to maximize the use of both the CPU and the GPU. In general, problems with larger tile sizes achieve relatively higher performance during computation. Knowing this, instead of using square tiles (Fig. 2), we decided to use rectangular tiles when possible. The only exception is the TRSM-left task/operation (Fig. 4), in which the rectangular tile is divided into a set of square tiles. This decomposition, which is carried out internally in the TRSM-left task at the implementation level, is necessary because this particular operation requires the vertical dimensions of both A (input) and B (output) matrices to be the same.Footnote 4

In our code, we use one pointer (A in Fig. 3) to allocate the matrix to be factorized. Additionally, we create four different memory spaces that correspond to the memory computed (tiles) from each of the tasks. This way we can define the two-level memory space, one used at the algorithm (IRIS) level (A pointer) and one used at the implementation (task) level (A_iris, B_iris0/1, and C_iris).

4.2 Tasking

As we described above, the idea is to maximize the use of both the CPU and the GPU. To do that, apart from carrying out the matrix decomposition illustrated in Fig. 4, we use multithreaded (CPU and GPU) computations to exploit the parallelism at both the algorithm and the implementation levels.

In our code, basically every task corresponds to one LAPACK or BLAS routine. As parameters, we need the same parameters that are described in the standard specification of these math libraries, so we can then see the tasks as a wrapper to a standard linear algebra library. For convenience, we implemented a different C structure data type per LAPACK or BLAS routine, which is then used to pass the arguments from the algorithm level to the implementation level.

Although it is well known that LU factorizations do not perform well for small matrices on GPUs, it is difficult to know which platform (i.e., CPU or GPU) is best suited for the rest of the tasks. This is particularly challenging when the workload and number of operations (i.e., size of the tiles) in each of the tasks change along the execution. Another important factor to consider is the differences of the components in our heterogeneous systems and the connections between them. Fortunately, in IRIS, one task can be run on either the CPU or the GPU; in other words, we can decide which architecture to use depending on the size of the tile or other factors. We implemented several approaches for each of the tasks using CPU-only, GPU-only, and CPU-GPU methods (Fig. 5).

Fig. 5.
figure 5

The IRIS algorithm-implementation partition.

Next, we explain the main characteristics of the different implementations of each task.

GETRF. The computation of the LU (no pivoting) factorization is carried out on the CPU. Although, we do not perform a GPU computation in this task, we can make some computationally expensive memory transfers between CPU and GPU, such as transferring the B matrices used by the TRSM tasks (TRSM-top and TRSM-left in Fig. 4) while LU factorization is being computed. So, we have two different implementations: (1) one in which we only compute the LU factorization on the CPU and (2) one in which we simultaneously compute the LU factorization, perform the CPU-to-GPU memory transfers for B matrices used by TRSM tasks, and perform the CPU-to-GPU transfer of the factorization output because this is also used by TRSM tasks.

TRSM-Top. Three different variants of TRSM-top were implemented: (1) a CPU version in which we make use of the TRSM routine within the CPU vendor libraries (e.g., IBM ESSL on Summit and Intel MKL on Oswald), (2) a GPU version in which we compute both a cuBLAS call for the TRSM computation and CPU-GPU memory copies for the input (from CPU to GPU) and output (from GPU to CPU), and (3) an optimization of the GPU version in which we compute a cuBLAS TRSM call and a GPU-CPU memory copy to transfer the result of the cuBLAS routine from GPU to CPU. In the last implementation, we do not carry out the CPU-to-GPU communication because this is performed in the GETRF task.

TRSM-Left. Two different versions of TRSM-left were implemented here: (1) a CPU implementation in which we compute the TRSM routine of the IBM ESSL library on Summit and the Intel MKL library on Oswald and (2) a GPU code in which we compute the CPU-GPU memory copies necessary to transfer the input to GPU memory and the output to CPU memory after computing the cuBLAS TRSM routine on the GPU.

4.3 GEMM

We implemented three different variants of GEMM: (1) a CPU code that uses CPU vendor libraries; (2) a GPU code in which the C matrix is transferred from CPU/GPU to GPU/CPU before/after the GPU computation of GEMM (cuBLAS), the A matrix is transferred from CPU to GPU before the computation, and the B matrix (output of TRSM-top) is already in GPU memory; and (3) an optimized GPU implementation in which only the A matrix is transferred from CPU to GPU before the computation of GEMM on the GPU, and the top-left tile of the C matrix is transferred from GPU to CPU after the computation of GEMM. In the last implementation, we transfer the whole matrix to be factorized from CPU to GPU at the very beginning of the execution. Although this can be time consuming, it is only done once and has important implications for the overall performance (see Sect. 5).

Table 1. Summit and Oswald hardware specifications.

5 Performance Analysis

This section describes the performance analysis of our code and the different variants/optimizations implemented. For a test case, we used a \(16,384\times 16,384\) matrix with a tile size of \(512\times 512\). We used two different heterogeneous systems for our analysis—Summit and Oswald (see Table 1 for the hardware features).

5.1 GETRF

For the GETRF task, we used Intel MKL’s LAPACKE_mkl_dgetrfnpi routine on Oswald, whereas we used our own code on Summit because the IBM ESSL library does not have a routine for the non-pivoting LU factorization. In terms of performance, the optimized vendor library (i.e., Intel MKL on Oswald) achieved much better performance (about 48 GFLOP/s) compared to our own implementation on Summit (15 GFLOP/s). However, as we describe below, this did not have a significant impact on the overall performance.

When overlapping communication with computation (i.e., to transfer the B matrix used by TRSM tasks and the output of the factorization from CPU to GPU), we see a fall in performance when compared to the CPU-only implementation, and we achieve an overall performance of 11 GFLOP/s on Summit and 42 GFLOP/s on Oswald.

5.2 TRSM-top

Fig. 6.
figure 6

TRSM-top performance.

Figure 6 illustrates the performance reached by the TRSM-top task on Summit and Oswald. As expected, the GPU-optimized implementation reaches the highest performance on both systems, at least in the first steps. On Summit, the performance of this implementation is considerably higher than the other two implementations—about \(2\times \) higher in some cases. In fact, this is the fastest implementation for all steps on Summit. On Oswald, the performance of the GPU-optimized implementation is higher than the CPU implementation (second fastest implementation) in the first steps; however, the CPU implementation is the faster one in the last steps, in which the computational cost and parallelism of computing TRSM is much lower. Although Summit has a much faster CPU-GPU connection (NVLink), a larger number of CPU cores and a more similar performance (GFLOP/s) between CPU and GPU makes the CPU implementation on Oswald faster than the GPU-optimized implementation for the last steps.

As shown, although the GPU-optimized implementation is the best choice for Summit, Oswald benefits from a heterogeneous approach (task) in which the GPU-optimized implementation is used during the first steps of the algorithm, and the CPU implementation is computed in the last steps.

5.3 TRSM-Left

Although we use the same level-3 BLAS operation from the previous task, we achieve very different performance results owing to the different matrix decomposition required by this operation. Again, as we can see in Fig. 7, the performance varies significantly depending on the target platform. While the performance of the GPU implementation is lower than that of the CPU implementation on Oswald, we see the opposite scenario for Summit, where the GPU implementation is considerably faster than the CPU implementation.

As was the case for the TRSM-top task, we also need a different configuration here depending on the target platform.

Fig. 7.
figure 7

TRSM-left performance.

5.4 Join TRSM-Top and TRSM-Left

For Oswald, we can see that although the GPU implementation is better for TRSM-top, the CPU implementation is the better choice for TRSM-left. Because both tasks are totally independent, this opens an opportunity for better performance on Oswald by computing both tasks in parallel using the more suitable implementation for each task. As shown Fig. 8, joining both tasks enables significant speedup.

Fig. 8.
figure 8

Left: Time (s) of TRSM-top + TRSM-left and join-TRSM. Right: Join-TRSM speedup.

5.5 GEMM

Unlike the other two tasks, for GEMM we see the same behavior in both heterogeneous systems (Fig. 9). The GPU-optimized implementation has proven to be the fastest implementation in both systems and is 6\(\times \)–8\(\times \) faster than the second-fastest approach. As expected, the performance decreases along the execution; this can be seen in the other tasks too, in which the computational cost and the parallelism are much lower in the last steps than in the first steps of the algorithm.

Fig. 9.
figure 9

GEMM performance.

5.6 Overall Performance

Here, we evaluate the overall performance for the different task implementations, and we start with the CPU-only implementations. On Oswald, we achieved an overall performance of 340 GFLOP/s, whereas on Summit we achieved 120 GFLOP/s. The relatively poor performance on Summit is from our implementation of the GETRF task.

Next, we evaluate using both the CPU and the GPU. For this, GETRF is computed on the CPU, and the rest of the tasks are computed on the GPU. We do not overlap computation with memory transfers at this level. On Oswald, we achieved 225 GFLOP/s, whereas on Summit, we achieved 511 GFLOP/s. Here, Summit’s faster CPU-GPU connection results in better performance. The lower CPU-GPU bandwidth on Oswald has a negative impact on performance, and the overall performance is lower than using only the CPU.

Moving on, we focus on the overall performance impact of the different optimizations implemented in the tasks. We start with Oswald. By overlapping the LU factorization with CPU-GPU communication in the GETRF task, and by using the GPU to compute the TRSM-top task, we increased performance to 235 GFLOP/s. Also, by running TRSM-left on the CPU instead of the GPU, we achieved 275 GFLOP/s. As shown in Fig. 9, the most important optimization consists of moving the whole matrix from CPU to GPU at the very beginning. Using the GPU-optimized implementation in the GEMM task, we increased the overall performance to 652 GFLOP/s. Finally, we conducted the last optimization, which consists of joining TRSM-left (computed on the CPU) and TRSM-top (computed on the GPU), which increased the overall performance to 700 GFLOP/s.

Next, we focus on Summit. By overlapping the LU factorization with the CPU-GPU communication in the GETRF task, we increased the overall performance to 546 GFLOP/s. As shown in Fig. 7, using the CPU is not faster than using the GPU for TRSM-left. With that in mind, the optimizations on Oswald for TRSM tasks are not beneficial on Summit, but computing both tasks on the GPU is better. Finally, using the GPU-optimized implementation of the GEMM task increases the overall performance considerably—achieving 1,972 GFLOP/s.

6 Related Works

Recently, we have seen important progress toward performance portability. Some examples are the C++ template metaprogramming libraries Kokkos [15] and RAJA [16]. These libraries can build different binaries that target different architectures from one source code. However, they cannot use more than a single architecture at a time.

Using CPUs and GPUs for HPC codes has been widely studied [13, 17, 18]. Since OpenMP 4.0, it is possible to use GPU offloading in OpenMP codes. Valero-Lara et al. [19] used OpenMP 4.5 to implement a heterogeneous version of the TRSM level-3 BLAS routine and achieved good performance on one node of Oak Ridge National Laboratory’s Summit supercomputer. One important reference for heterogeneous linear algebra codes is the MAGMA [20] library. MAGMA, offers multiple heterogeneous implementations for several LAPACK routines. Unfortunately, there is not an implementation for our test case.

In contrast, our work focuses on the potential benefits of using IRIS for performance portability on heterogeneous HPC architectures. To the best of our knowledge, this is the first time that a portable and heterogeneous LU factorization code (i.e., IRIS) has been implemented and analyzed.

7 Final Remarks and Future Directions

The difference in current and upcoming heterogeneous systems hinders implementation of HPC codes. By using IRIS, we not only make this effort more affordable, but we can also implement portable and heterogeneous HPC codes by separating the algorithm design from the implementation. First, we described our algorithm using tasks + dependencies. After that, we had to decide which code to use in each of the tasks. A specific and different setting must be used depending on the target platform. In this paper, we were able to optimize one of the most important HPC algorithms, the LU factorization, on two different heterogeneous platforms with minimal modifications to the code.

However, a more thorough and computationally expensive study is required to evaluate which code/implementation should be used in each of the tasks. In the future, we plan to implement alternatives that enable one to compute the setting in an automatic and computationally cheaper manner. We also want to extend this effort to other HPC applications and heterogeneous systems.