Tile Low-Rank GEMM Using Batched Operations on GPUs

Charara, Ali; Keyes, David; Ltaief, Hatem

doi:10.1007/978-3-319-96983-1_57

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11014))

Included in the following conference series:

European Conference on Parallel Processing

2787 Accesses
7 Citations

Abstract

Dense General Matrix-Matrix (GEMM) multiplication is a core operation of the Basic Linear Algebra Subroutines (BLAS) library, and therefore, often resides at the bottom of the traditional software stack for many scientific applications. In fact, chip manufacturers give a special attention to the GEMM kernel implementation since this is exactly where most of the high-performance software libraries extract hardware performance. With the emergence of big data applications involving large data-sparse, hierarchically low-rank matrices, the off-diagonal tiles can be compressed to reduce the algorithmic complexity and the memory footprint. The resulting tile low-rank (TLR) data format is composed of small data structures, which retain the most significant information for each tile. However, to operate on low-rank tiles, a new GEMM operation and its corresponding API have to be designed on GPUs so that the data sparsity structure of the matrix can be exploited while leveraging the underlying TLR compression format. The main idea consists of aggregating all operations into a single kernel launch to compensate for their low arithmetic intensities and to mitigate the data transfer overhead on GPUs. The new TLR-GEMM kernel outperforms the cuBLAS dense batched GEMM by more than an order of magnitude and creates new opportunities for TLR advanced algorithms.

You have full access to this open access chapter, Download conference paper PDF

Towards efficient tile low-rank GEMM computation on sunway many-core processors

Article 15 October 2020

Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

Keywords

1 Introduction

With the convergence of the third and fourth paradigms (i.e., simulation and big data), large-scale scientific applications, such as climate/weather forecasting [31], require a profound redesign to reduce the memory footprint as well as the overall algorithmic complexity. When considering multi-dimensional problems, with a large number of unknowns, n, the resulting covariance matrix may render its structure fully dense. To overcome the curse of dimensionality without violating the fidelity of the physical model, application developers rely on approximation methods, which drastically reduce the constraints on the memory footprint and the algorithmic complexity. For instance, hierarchical matrices or \(\mathscr {H}\)-matrices have been recently resurrected for high-performance computing [29, 32] as a potential algorithmic solution to tackle the aforementioned challenge. Because of their inherent recursive formulations, they are not amenable to massively parallel hardware systems such as GPUs.

We have designed, investigated, and implemented, on x86 shared-memory systems within the HiCMA library, an approximation method that exploits the natural data sparsity of the off-diagonal tiles while exposing parallelism to the fore [8]. Based on the tile low-rank (TLR) data format, the off-diagonal tiles of a dense covariance matrix are compressed up to a specific accuracy threshold, without compromising the model fidelity. The resulting data structure, much smaller than the original dense tiles, represents the new building blocks to pursue the matrix computations. Since main memory is a scarce resource on GPUs, TLR should enable solving even larger GPU-resident problems and eventually fall back to out-of-core algorithms. Although this TLR scenario may look utterly GPU friendly and more compliant, there are still some lingering performance bottlenecks. Indeed, decomposing an off-diagonal low-rank matrix problem into tasks may lead to a computational mismatch between the granularity of the task and the computational power of the underlying hardware. In particular, heavily multi-threaded accelerators such as NVIDIA GPUs need to maintain high occupancy and would require developers to move away from the current model, where tasks occupy all hardware computing elements, and, instead, simultaneously execute multiple smaller tasks, each spanning across a subset of hardware resources. This mode of operation, called batched execution [19], executes many smaller operations in parallel to make efficient use of the hardware and its instruction-level parallelism. To our knowledge, this work introduces the first TLR general matrix-matrix multiplication (TLR-GEMM) operating on data sparse matrix structures using GPU hardware accelerators. Our research contribution lies at the intersection of two concurrent and independent efforts happening in the scientific community: \(\mathscr {H}\)-matrix and batched operations for BLAS/LAPACK. Our TLR-GEMM leverages the current batched execution kernels in BLAS and LAPACK to support the matrix-matrix multiplication operation, which is perhaps one of the most important operations for high-performance numerical libraries, in TLR data format. In this paper, we focus on compressing data-sparse matrices and operating on them with uniform rank sub-blocks. Non-uniform rank compression and operation is a subject under investigation and beyond the scope of this paper. Our TLR-GEMM implementation is available in the open source KBLAS library maintained at https://github.com/ecrc/kblas-gpu.

The remainder of the paper is organized as follows: Sect. 2 presents related work and details our research contributions; Sect. 3 recalls the batched linear algebra community effort and gives a general overview of the hierarchical low-rank matrix approximation; Sect. 4 introduces the new TLR-GEMM and its variants; the implementation details of the various TLR-GEMM kernels are given in Sect. 5; Sect. 6 assesses the accuracy and performance of TLR-GEMM on the latest NVIDIA GPU hardware generation and compares it to the state-of-the-art high-performance batched dense GEMM, as implemented in [2]; we conclude in Sect. 7.

2 Related Work

The general matrix-matrix multiplication (GEMM) operation is the primitive kernel for a large spectrum of scientific applications and numerical libraries. GEMM has been optimized on various hardware vendors for large matrix sizes and constitutes the basic reference for Level-3 BLAS [18] operations and their usage in dense linear algebra algorithms. With the need to solve multicomponent partial differential equations, the resulting sparse matrix may have a dense block structure. The block size is relatively small and corresponds to the number of degrees of freedom per mesh element. The blocks are usually stored in a compressed block column/row data format. Matrix computations are then performed on these independent blocks by means of batched operations. For instance, batched dense matrix-vector multiplication is employed in sparse iterative solvers for reservoir simulations [6], while batched dense LAPACK factorizations [15] are required in sparse direct solvers for the Poisson equation using cyclic reduction [17]. Moreover, with the emergence of artificial intelligence, batched dense GEMM of even smaller sizes are needed in tensor contractions [4, 5, 33] and in deep learning frameworks [3, 27, 28]. To facilitate the adoption of all these efforts, a new standard has been proposed to homogenize the various batched API [19]. While the literature is rich in leveraging batched executions for dense and sparse linear algebra operations on x86 and hardware accelerators [4,5,6, 14, 15, 19], this trend has faced challenges and has not penetrated data-sparse applications yet, involving large hierarchically low-rank matrices (i.e., \(\mathscr {H}\)-matrices). In fact, there are three points to consider when designing batched operations for \(\mathscr {H}\)-matrices. First, there should be an efficient batched compression operation for \(\mathscr {H}\)-matrices on x86 and on hardware accelerators. Second, the inherent recursive formulation of \(\mathscr {H}\)-matrix resulting from nested dissections should be replaced, since it is not compliant with batched operations. Third, strong support is eventually required to handle batched matrix operations on the data-sparse compressed format, such as \(\mathscr {H}^2\), Hierarchically Semi-Separable representation (HSS), and the Hierarchical Off-Diagonal Low-Rank (HODLR) matrix. More recently, a block/tile low-rank (TLR) compressed format has been introduced on x86 [8, 11], which further exposes parallelism by flattening the recursion trees. The TLR data format may engender new opportunities and challenges for batched matrix compression and operation kernels on advanced SIMD architectures. Moreover, an effective implementation of the randomized SVD [26] for \(\mathscr {H}^2\) matrix compression has been ported to hardware accelerators [14]. The aforementioned three bottlenecks may now be relieved to deploy TLR matrix computations on GPUs.

3 Background

Batched Dense Linear Algebra. GPUs are massively parallel devices optimized for high SIMD throughput. Numerical kernels with low arithmetic intensity may still take advantage of the high memory bandwidth, provided they operate on large data structures to saturate the bus bandwidth. When operating on relatively small workloads on GPUs, the GPU overheads are twofold: (1) the overhead of moving the data from the CPU to the GPU memory through the thin PCIe pipe may be not worthwhile, and (2) the overhead of launching the kernels is not compensated by the low computation complexity. High-performance frameworks for batched operations [1, 2, 15] attempt to overcome both challenges by stitching together multiple operations occurring on independent data structures. This batched mode of execution increases hardware occupancy to attain higher sustained peak bandwidth while launching a single kernel to remove the kernel launch overheads all together. Figure 1(a) sketches batched operations of small dense GEMM operations \(C \; += \; A\times B\). Following the same community effort for the legacy BLAS, a community call for standardizing the batched API [19] has been initiated, gathering hardware vendors and researchers. This standardization effort enhances software development productivity, while the batched API gains maturity in the scientific community.

Hierarchical Low-Rank Matrix Computations. The hierarchically low-rank matrix, or \(\mathscr {H}\)-matrix [21, 23,24,25, 34] is a low-rank block approximation of a dense (sub)matrix, whose off-diagonal blocks may be each represented by an outer product of rectangular bases obtained from compression, e.g., via orthogonal transformations with singular value decomposition. An \(\mathscr {H}\)-matrix captures the most significant singular values and their corresponding singular vectors up to an application-dependent accuracy threshold for each block. Figure 1(b) highlights the structure of an \(\mathscr {H}\)-matrix resulting from a boundary element method. Each off-diagonal green block has been approximated to a similar rank up to 9. The red blocks are dense blocks and are mostly located around the diagonal structure of the matrix. This data sparse structure may be exposed by performing nested dissection after proper reordering and partitioning. This recursive formulation allows traversing low-rank off-diagonal blocks and compressing them using an adequate data storage format such as \(\mathscr {H}\) [29], Hierarchical Off-Diagonal Low-Rank (HODLR) [12], \(\mathscr {H}\) [22], Hierarchically Semi-Separable representation (HSS) [10, 32], and \(\mathscr {H}^2\) [13]. All these data compression formats belong to the family of \(\mathscr {H}\)-matrices and can be differentiated by the type of their bases i.e., nested (HSS and \(\mathscr {H}^2\)) or non-nested (\(\mathscr {H}\) and HODLR), in addition to the type of admissibility conditions, i.e., strong (\(\mathscr {H}\) and \(\mathscr {H}^2\)) or weak (HODLR and HSS). Each of these compression data formats exhibits different algorithmic complexities and memory footprint theoretical upper-bounds. The tile low-rank (TLR) [8, 11] data format is another case of the \(\mathscr {H}\) data format with non-nested bases and strong admissibility conditions. The matrix is logically split into tiles, similar to the tile algorithms from the PLASMA library [7]. The off-diagonal tiles may then be compressed using the rank-revealing QR once the whole dense matrix has been generated [11] or on-the-fly using the randomized singular value decomposition [26] while performing the matrix computations [8]. Figure 1(c) shows an illustration of such a TLR matrix composed of an \(8\,\times \,8\) logical tile. Owing to its simplicity, TLR permits flattening the inherent recursive formulation. Although TLR may not provide such optimal theoretical bounds as for the nested-basis data formats, regarding algorithmic complexities and memory footprint, it is very amenable to advanced performance implementations and optimizations.

The main objective of this paper is to consolidate the three messages conveyed by the sketches of Fig. 1, i.e., batched operations (including matrix compression and computations), \(\mathscr {H}\)-matrix applications and TLR data format. This consolidation is the crux of the TLR-GEMM implementation on GPUs.

4 Design of Tile Low-Rank GEMM Kernels

This section describes the design of the TLR-GEMM kernel and identifies its variants using a bottom-up approach: from the single GEMM kernel operating on low-rank data format to the corresponding batched operations, and then all the way up to the actual TLR-GEMM driver.

Low-Rank Data Format. Low-rank approximation consists of compressing a dense matrix X of dimensions m-rows and n-columns and representing it as the product of two tall and skinny matrices, such that \(X=X_u \times X_v^T\), with \(X_u\) and \(X_v\) of dimensions (m-rows, k-columns) and (n-rows, k-columns), respectively, and k the rank of X. The choice for the compression algorithms is typically Rank-Revealing QR (RRQR), adaptive cross-approximation (ACA), or (randomized) singular value decomposition (SVD), etc. The randomized Jacobi-based SVD [26] maps well on SIMD architectures because it does not involve pivoting nor element sweeping, as in the RRQR and ACA methods, respectively. It is perhaps the most optimized compression algorithm on GPUs, as implemented in [14].

Low-Rank GEMM Variants. We identify four possible variants, based on the input data format of the involved matrices A, B, and C. We use the following notation: \(\mathtt {GEMM-}T_AT_BT_C\) is the GEMM kernel for a given input type (Dense or Low-rank) for the matrices A, B, and C. The variants are as follows: (1) either of A or B in low-rank format, C in dense format: GEMM-DLD, or GEMM-LDD, (2) either of A or B in low-rank format, C in low-rank format: GEMM-DLL, or GEMM-LDL, (3) A and B in low-rank format, C in dense format: GEMM-LLD, as illustrated in Fig. 2(a), and, (4) A, B and C in low-rank format: GEMM-LLL, as illustrated in Fig. 2(b). We focus solely on the last two variants in this paper, i.e., GEMM-LLD and GEMM-LLL, since they are the bases for supporting the Schur complement calculation and for matrix factorization, in the context of sparse direct solvers [11] and data-sparse matrix solvers [8], respectively. In fact, the other possible variants, i.e., when both A and B are in dense format and C in dense (GEMM-DDD) or in low-rank format (GEMM-DDL), are not considered because the first is the actual legacy GEMM operation, and the second is more expensive than regular GEMM in terms of flops.

Batched Low-Rank GEMM. We can then derive the batched low-rank GEMM routines from their corresponding single low-rank GEMM routines. The new batched low-rank GEMM kernels are now defined as single kernels, i.e., Batched-GEMM-LLD and Batched-GEMM-LLL, which simultaneously execute independent GEMM-LLD and GEMM-LLL operations, as demonstrated in Fig. 2(c) and (d), respectively. This batched kernel is used as a building block for the main driver performing the TLR-GEMM on large TLR matrices, as described in the following paragraph.

TLR-GEMM (driver). In the driver of the TLR-GEMM operation, the data-sparse matrices A, B, and C are subdivided into a grid of tiles, where each tile may individually be compressed into low-rank data form, as illustrated in Fig. 3. Indeed, Fig. 3(a) and (b) represent the TLR-GEMM operation, when \(T_C\) is tile dense or tile low-rank, respectively. In a standard GEMM operation, each tile of the matrix C is updated by an inner-product composed of a sequence of pair-wise GEMM operations of its corresponding row of tiles from matrix A and column of tiles from matrix B. However, when dealing with TLR data format, since the workload of each low-rank tile is too small to saturate a modern GPU with sufficient work, concurrent processing of these independent low-rank tiles inner-products is necessary to increase the GPU occupancy. To overcome this challenge, we need to process these inner-product low-rank GEMM calls in a batched mode. However, available batched GEMM routines assume the batched operation is a primitive CUDA kernel rather than a sequence of calls; thus, we re-formulate the set of inner-products into a set of successive outer-products by means of loop re-ordering. Each outer-product is a call to batched GEMM-LLD or GEMM-LLL routine which updates all tiles of matrix C in parallel. This process is repeated nt times, nt being the number of tiles in a row of matrix A or a column of matrix B, as illustrated in Fig. 3 for both variants of TLR-GEMM.

5 Implementation Details

Update Dense C: GEMM-LLD. This operation is performed as a sequence of three small GEMM calls. Assuming matrices C and A are of m-rows, C and B of n-columns, A and B of k-columns and k-rows respectively, and A and B are of ranks \(r_a\), and \(r_b\), respectively, the operation \(C=\alpha A \times B + \beta C\) is equivalent to \(C=\alpha A_u \times A_v^T \times B_u \times B_v^T + \beta C\).

Update Low-Rank C: GEMM-LLL. We describe the second variant of the low-rank GEMM operation when updating C in low-rank format, as outlined in the corresponding Algorithm 1. In fact, this algorithm corresponds to the randomized SVD, as described in [26]. We assume the non-transpose case for both matrices A and B. The matrix-matrix multiplication \(C=\alpha A \times B + \beta C\) involves two sub-stages, where matrices A, B, and C are represented by their low-rank format \((A_u,A_v), (B_u,B_v)\), and \((C_u,C_v)\), respectively. The first stage consists of the multiplication of low-rank matrices A and B, as shown in steps \(2-3\) of Algorithm 1. The second stage highlights the final addition of the intermediate matrix with the low-rank matrix C, as demonstrated in steps \(4-6\). This second stage produces low-rank \(\dot{C}=\dot{C_u} \times \dot{C_v}\) with bloated rank \(\dot{r_c}\). As such, low-rank matrix addition, as described by Grasedyck [20], requires a process of recompression based on QR factorization to restore a minimal rank for the product matrix as well as the orthogonality of its components. This recompression is achieved by reforming the product \(\dot{C_u} \times \dot{C_v}\) in terms of its SVD representation, i.e., its singular values and their corresponding right and left singular vectors. By factorizing \(\dot{C_u}= Q_u \times R_u\), and \(\dot{C_v}= Q_v \times R_v\), we can then represent \(\dot{C}=Q_u \times R_u \times (Q_v \times R_v)^T = Q_u \times (R_u \times R_v^T) \times Q_v^T\), as the SVD of \(\dot{C}\). Recompressing the result of the tiny product \(R_u \times R_v^T\) using SVD or ACA, enables restoration of the rank of \(\dot{C}\) to a minimum value based on a predetermined fixed accuracy threshold or fixed rank truncation. This process of re-compression is described in steps 7–18 of Algorithm 1. The implementation of this variant leverages the randomized SVD on GPUs from [14], in the context of matrix compression for \(H^2\) data format, to the TLR data format.

Batched Low-Rank GEMM. For batching the two GEMM-LLD and GEMM-LLL variants, the challenges are quite different. Batched GEMM-LLD is straightforward to implement on GPUs (and even on x86), due to existing fast batched GEMM implementations on small sizes [2, 27]. The task is far more complex for GEMM-LLL, since the recompression involves numerical kernels (GEQRF, ORGQR and GESVD), which are not as regular as standard GEMMs, e.g., in terms of memory accesses. The support from vendor numerical libraries for batched versions of these routines is limited with poorly performing or simply inexistent implementations. We have further leveraged the batched GEQRF and ORGQR from [14] and integrated this into the batched GEMM-LLL. For the batched GESVD on the tiny \(k\times k\) matrix, there are two options. The first is again based on the randomized SVD itself, while the second uses a novel ACA implementation on GPUs. Although ACA may require an expensive element sweeping procedure, this overhead is mitigated by the small matrix size. The resulting algorithm for batched low-rank is very similar to Algorithm 1, except that each call is now performed in a batched mode of execution.

TLR-GEMM (driver). Putting all previous standard and batched kernels together, we present TLR-GEMM on GPUs. We leverage the batched low-rank GEMM and operate on TLR matrices. This modular approach allows assessing the performance of each component, while enhancing software development productivity. The algorithm for TLR-GEMM driver consists of a single loop of nt successive outer-products, each corresponding to a batched GEMM-LLD or GEMM-LLL call, as depicted in Fig. 3. Compared to a GEMM operation on matrices with non-TLR data formats (involving recursion and tree traversals), TLR proves to be a simple yet effective approach, especially when considering hardware accelerators. For the algorithmic complexity of each variant, it is obvious that TLR-GEMM based on GEMM-LLL is more expensive than the one based on GEMM-LLD, because of the recompression stage.

6 Experimental Results

The benchmarking system is a two-socket 20-core Intel Broadwell running at 2.20 GHz with 512 GB of main memory, equipped with an NVIDIA GPU Volta V100 with 16 GB of main memory and PCIe 16x. We use a data-sparse matrix kernel (i.e., Hilbert) with singular values following an exponential decay. In fact, such decay in singular values is frequently observed in many matrix kernels in covariance-based scientific applications, such as climate/weather forecasting simulations [9]. All calculations are performed in double precision arithmetics. The reported performance numbers are compared to cuBLAS batched dense GEMM. Figure 4 illustrates the singular value distribution and the numerical accuracy assessment. The singular values of the Hilbert matrix kernel exponentially decay, as seen in Fig. 4(a). Approximately the first 30 are the most significant, while the remainder are close to machine precision and can be safely ignored. Figure 4(b) demonstrates the numerical robustness of the single GEMM-LLD and GEMM-LLL kernel variants using the same Hilbert matrix operator. GEMM-LLD approaches expected accuracy for rank smaller than GEMM-LLL, due to the rounding errors introduced by the additional floating-point operations from the recompression stage. Otherwise, both variants show correctness when truncating at ranks close to the accuracy threshold shown in Fig. 4(a).

Figure 5 highlights the speedups of batched GEMM-LLD and GEMM-LLL, against cuBLAS batch dense GEMM considering various ranks and a fixed batch size of 1000. Figure 5(a) and (b) illustrate the speedups of batched GEMM-LLD with and without compression overhead, respectively. Similarly, Fig. 5(c) and (d) illustrate the speedups of batched GEMM-LLL with and without compression overhead, respectively. Obviously, compression may turn out to be an expensive operation, which may slow down the the performance of batched GEMM-LLD (Fig. 5(a)) and batched GEMM-LLL (Fig. 5(c)); however, this overhead is usually occurring once, since the compressed form of the corresponding matrices may be used repeatedly (see TLR-GEMM in Fig. 5(b) and (d)). The speedups recorded for batched GEMM-LLD are higher than those for GEMM-LLL, when comparing to the cuBLAS batch dense GEMM, because of the recompression step. While speedups are obtained for all ranks for batched GEMM-LLD (Fig. 5(b)), batched GEMM-LLL (Fig. 5(d)) records speedups only for relatively small rank sizes. Although the Hilbert matrix kernel has an exponential singular value decay, we also assess performance for larger ranks. These extra flops, although unnecessary, allow stretching of the batched kernels and observing when the crossover point occurs. For instance, in Fig. 5(b), the batched GEMM-LLD with rank 128 runs out of memory, due to the dense storage of the matrix C, while still outperforming the cuBLAS batch dense GEMM. In Fig. 5(d), the batched GEMM-LLL with rank 128 runs out of memory, due to the temporary memory space required by the recompression stage, while not being able to outperform the cuBLAS batch dense GEMM.

Figure 6 shows the speedups for batched GEMM-LLD and GEMM-LLL, with varying batch count, against cuBLAS batch dense GEMM, when using the ranks at which numerical accuracy is reached from Fig. 4(b), i.e., 16 and 32, respectively. The performance speedup increase as the batch count rises reveals how the device becomes overwhelmed due to high occupancy.

Figure 7 presents the elapsed time of TLR-GEMM based on batched GEMM-LLD, named TLR-GEMM-LLD (Fig. 7(a) and (b)), and based on batched GEMM-LLL, named TLR-GEMM-LLL (Fig. 7(c) and (d)), considering various ranks, against cuBLAS dense GEMM. TLR-GEMM-LLD (solid line plots) outperforms cuBLAS dense GEMM (double line plot) by more than an order of magnitude when A and B are already compressed, as shown in Fig. 7(a). When the matrices A and B are not compressed, the performance speedup slightly drops to eightfold, as seen in Fig. 7(b). Indeed, the expensive compression of matrices A and B is only performed once, followed by successive outer-products, in the form of batched GEMM-LLD calls. This allows to mitigate the compression overhead, discussed earlier in the section. TLR-GEMM-LLL (solid line plots) outperforms cuBLAS dense GEMM (double line plot) by more than an order of magnitude when A and B are already compressed, as shown in Fig. 7(c). When the matrices A and B are not compressed, the performance speedup remains almost the same, since the (re)compression is the most time consuming part of the batched GEMM-LLL operations (Fig. 7(d)).

Figure 8 highlights the performance enhancements when using the Hilbert matrix kernel to perform TLR-GEMM with appropriate ranks for GEMM-LLD and GEMM-LLL, 16 and 32, respectively. Although the number of floating-point operations varies, the objective is to achieve the expected numerical accuracy. TLR-GEMM-LLD and TLR-GEMM-LLL kernels (solid line plots) score a speedup of more than an order of magnitude and fourfold, respectively, against cuBLAS dense GEMM (double line plot).

7 Conclusions and Future Work

This paper presents a novel batched tile low-rank (TLR) GEMM kernel on GPUs, which is a core operation of large-scale data sparse applications. Results demonstrate the numerical robustness and manyfold performance speedups against cuBLAS batched dense GEMM on the latest NVIDIA V100 GPU generation. This work represents a pathfinder toward enabling advanced hierarchical matrix computations on GPUs. Moreover, owing to its simplicity and modularity, the TLR data format may facilitate the port to multiple GPUs of batched low-rank matrix operations. Future work includes supporting non-uniform ranks for compression and operations to further reduce the memory footprint and flop count cost, in addition to supporting the other BLAS routines. We would like also to integrate the TLR compression and the TLR-GEMM operation in the Multi-Object Adaptive Optics application [30] in the context of computational astronomy and assess its real-time performance impact.

References

Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee. http://icl.cs.utk.edu/magma/
The NVIDIA CUDA Basic Linear Algebra Subroutines (CUBLAS). http://developer.nvidia.com/cublas
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Abdelfattah, A., et al.: High-performance tensor contractions for GPUs. Procedia Comput. Sci. 80, 108–118 (2016). International Conference on Computational Science 2016, ICCS 2016, San Diego, California, USA, 6–8 June 2016
Article Google Scholar
Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Performance, design, and autotuning of batched GEMM for GPUs. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 21–38. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41321-1_2
Chapter Google Scholar
Abdelfattah, A., Ltaief, H., Keyes, D.E., Dongarra, J.J.: Performance optimization of sparse matrix-vector multiplication for multi-component PDE-based applications using GPUs. Concurr. Comput.: Pract. Exp. 28(12), 3447–3465 (2016)
Article Google Scholar
Agullo, E., et al.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J. Phys: Conf. Ser. 180(1), 012037 (2009)
Google Scholar
Akbudak, K., Ltaief, H., Mikhalev, A., Keyes, D.: Tile low rank cholesky factorization for climate/weather modeling applications on manycore architectures. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds.) ISC 2017. LNCS, vol. 10266, pp. 22–40. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58667-0_2
Chapter Google Scholar
Akbudak, K., Ltaief, H., Mikhalev, A., Charara, A., Keyes, D.: Exploiting data sparsity for large-scale matrix computations. In: Aldinucci, M., et al. (eds.) Euro-Par 2018. LNCS, vol. 11014, pp. xx–yy. Springer, Cham (2018)
Chapter Google Scholar
Ambikasaran, S., Darve, E.: An \(\mathscr {O}({N} \log {N})\) fast direct solver for partial hierarchically semiseparable matrices. J. Sci. Comput. 57(3), 477–501 (2013)
Article MathSciNet Google Scholar
Amestoy, P.R., Ashcraft, C., Boiteau, O., Buttari, A., L’Excellent, J.Y., Weisbecker, C.: Improving multifrontal methods by means of block low-rank representations. SIAM J. Sci. Comput. 37(3), A1451–A1474 (2015). https://doi.org/10.1137/120903476
Article MathSciNet MATH Google Scholar
Aminfar, A., Darve, E.: A fast sparse solver for Finite-Element matrices. arXiv:1403.5337 [cs.NA], pp. 1–25 (2014)
Börm, S.: Efficient numerical methods for non-local operators: \(\mathscr {H}^2\)-Matrix compression, algorithms and analysis. EMS Tracts in Mathematics, vol. 14. European Mathematical Society (2010)
Google Scholar
Boukaram, W.H., Turkiyyah, G., Ltaief, H., Keyes, D.E.: Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression. Parallel Comput. 74, 19–33 (2017)
Article MathSciNet Google Scholar
Charara, A., Keyes, D., Ltaief, H.: Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs. ACM Trans. Math. Softw. (2017, submitted). (under review, http://hdl.handle.net/10754/622975)
Charara, A., Keyes, D., Ltaief, H.: Software artifact for Euro-Par 2018: Tile Low-Rank GEMM Using Batched Operations on GPUs. figshare. Code. (2018). https://doi.org/10.6084/m9.figshare.6387623
Chávez, G., Turkiyyah, G., Zampini, S., Ltaief, H., Keyes, D.: Accelerated cyclic reduction: a distributed-memory fast solver for structured linear systems. Parallel Comput. 74, 65–83 (2017)
Article MathSciNet Google Scholar
Dongarra, J., Du Croz, J., Hammarling, S., Hanson, R.J.: An extended set of Fortran basic linear algebra subprograms. ACM Trans. Math. Softw. 14, 1–17 (1988)
Article Google Scholar
Dongarra, J., et al.: A proposed API for batched basic linear algebra subprograms. Mims preprint, University of Manchester (2016). http://eprints.maths.manchester.ac.uk/id/eprint/2464
Grasedyck, L., Hackbusch, W.: Construction and arithmetics of \(\mathscr {H}\)-matrices. Computing 70(4), 295–334 (2003). https://doi.org/10.1007/s00607-003-0019-1
Article MathSciNet MATH Google Scholar
Hackbusch, W.: A sparse matrix arithmetic based on \(\mathscr {H}\)-matrices. part i: introduction to \(\mathscr {H}\)-matrices. Computing 62(2), 89–108 (1999). https://doi.org/10.1007/s006070050015
Article MathSciNet MATH Google Scholar
Hackbusch, W.: Hierarchical Matrices: Algorithms and Analysis. Springer Series in Computational Mathematics, vol. 49. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-47324-5
Book MATH Google Scholar
Hackbusch, W., Börm, S.: Data-sparse approximation by adaptive \({\mathscr {H}}^2\)-matrices. Computing 69(1), 1–35 (2002). https://doi.org/10.1007/s00607-002-1450-4
Article MathSciNet MATH Google Scholar
Hackbusch, W., Börm, S., Grasedyck, L.: HLib 1.4. Max-Planck-Institut, Leipzig (2012)
Google Scholar
Hackbusch, W., Khoromskij, B., Sauter, S.: On \(\mathscr {H}^2\)-matrices. In: Bungartz, H.J., Hoppe, R.H.W., Zenger, C. (eds.) Lectures on Applied Mathematics, pp. 9–29. Springer, Heidelberg (2000). https://doi.org/10.1007/978-3-642-59709-1_2
Chapter MATH Google Scholar
Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011). https://doi.org/10.1137/090771806
Article MathSciNet MATH Google Scholar
Heinecke, A., Henry, G., Hutchinson, M., Pabst, H.: LIBXSMM: accelerating small matrix multiplications by runtime code generation. In: 0001, J.W., Pancake, C.M. (eds.) Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, 13–18 November 2016, p. 84. ACM (2016)
Google Scholar
Kim, K., et al.: Designing vector-friendly compact BLAS and LAPACK kernels. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, pp. 55:1–55:12. ACM, New York (2017). https://doi.org/10.1145/3126908.3126941
Kriemann, R.: LU factorization on many-core systems. Comput. Vis. Sci. 16(3), 105–117 (2013). https://doi.org/10.1007/s00791-014-0226-7
Article MathSciNet MATH Google Scholar
Ltaief, H., et al.: Real-time massively distributed multi-object adaptive optics simulations for the european extremely large telescope. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), accepted, May 2018
Google Scholar
North, G.R., Wang, J., Genton, M.G.: Correlation models for temperature fields. J. Clim. 24, 5850–5862 (2011)
Article Google Scholar
Rouet, F.H., Li, X.S., Ghysels, P., Napov, A.: A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Trans. Math. Softw. 42(4), 27:1–27:35 (2016)
Article MathSciNet Google Scholar
Shi, Y., Niranjan, U.N., Anandkumar, A., Cecka, C.: Tensor contractions with extended BLAS kernels on CPU and GPU. In: HiPC, pp. 193–202. IEEE Computer Society (2016)
Google Scholar
Tyrtyshnikov, E.: Mosaic-skeleton approximations. Calcolo 33(1), 47–57 (1996)
Article MathSciNet Google Scholar

Download references

Data Availability Statement and Acknowledgments

The datasets and code generated during and/or analysed during the current study are available in the figshare repository: https://doi.org/10.6084/m9.figshare.6387623 [16]. We would like to acknowledge Paris Observatory (LESIA, France) for giving us remote access to their Volta-based system, sponsored through a grant from project #671662 (Green Flash), funded by European Commission under program H2020-EU.1.2.2 coordinated in H2020-FETHPC-2014.

Author information

Authors and Affiliations

Extreme Computing Research Center, Division of Computer, Electrical, and Mathematical Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal Jeddah, 23955, Kingdom of Saudi Arabia
Ali Charara, David Keyes & Hatem Ltaief

Authors

Ali Charara
View author publications
You can also search for this author in PubMed Google Scholar
David Keyes
View author publications
You can also search for this author in PubMed Google Scholar
Hatem Ltaief
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Charara .

Editor information

Editors and Affiliations

Department of Computer Science, University of Torino, Torino, Italy
Marco Aldinucci
Department of Computer Science, University of Torino, Torino, Italy
Luca Padovani
Department of Computer Science, University of Pisa, Pisa, Italy
Massimo Torquati

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Charara, A., Keyes, D., Ltaief, H. (2018). Tile Low-Rank GEMM Using Batched Operations on GPUs. In: Aldinucci, M., Padovani, L., Torquati, M. (eds) Euro-Par 2018: Parallel Processing. Euro-Par 2018. Lecture Notes in Computer Science(), vol 11014. Springer, Cham. https://doi.org/10.1007/978-3-319-96983-1_57

Download citation

DOI: https://doi.org/10.1007/978-3-319-96983-1_57
Published: 01 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96982-4
Online ISBN: 978-3-319-96983-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Tile Low-Rank GEMM Using Batched Operations on GPUs

Abstract

Similar content being viewed by others

Towards efficient tile low-rank GEMM computation on sunway many-core processors

Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

Keywords

1 Introduction

2 Related Work

3 Background

4 Design of Tile Low-Rank GEMM Kernels

5 Implementation Details

6 Experimental Results

7 Conclusions and Future Work

References

Data Availability Statement and Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Tile Low-Rank GEMM Using Batched Operations on GPUs

Abstract

Similar content being viewed by others

Towards efficient tile low-rank GEMM computation on sunway many-core processors

Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

Keywords

1 Introduction

2 Related Work

3 Background

4 Design of Tile Low-Rank GEMM Kernels

5 Implementation Details

6 Experimental Results

7 Conclusions and Future Work

References

Data Availability Statement and Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation